What's under the hood of LLMs?

The LLM stack has 4 layers:

Data layer
1. RAG (retrieval-augmented generation) allows LLMs to make predictions based on a corpus of input data, increasing contextual awareness and eliminating hallucination
  1. ELT & featurization
    1. Feature data to increase search and retrieval effectiveness. A prevalent way to do so is by breaking the data into chunks and transform them into vectors to do semantic search.
    2. Companies working in this area include: Superlinked, Unstructured, Waveline
  2. Storage
    1. Vector databases that stores these vectors for semantic search
    2. Companies working in this area include: Pinecone, Zilliz, Weaviate, Chroma, Featureform, LanceDB, pgvector
  3. Search & Retrieval
    1. During inference, information must be retrieved and fed into the model at high speed. Some retrieval methods include:
      - Simplistic – vector similarity search on naive text embeddings
      - Complex – hand-crafted rules-based systems (e.g. Github Copilot, well described in this blog post)
    2. Companies working in this area includes: Nomic, Aryn, Metal, Vald, MyScale
2. Companies working on end-to-end data layer include: LlamaIndex, Activeloop, Inkeep, Baseplate, Vespa
Model layer
1. There are 2 future emerging paths:
  1. Large foundation model dominates
  2. Fine-tuned models for each specific use case
2. There are 4 components in the model layer:
  1. Core model
    1. train the model from scratch ($100M spend on GPT-4) or fine-tune a pre-trained model
    2. Companies working in this area include: Google, OpenAI, Anthropic, Adept, Imbue, Cohere, Hugging Face, Stability AI, Mistral, Contextual, DeepInfra
  2. Serving & Computing
    1. Pricing
      1. GPT 3.5 costs $0.01-0.03 per query vs. GPT 4 $3
      2. Self-hosting is even more costly - Nvidia A100 cost $4-40 per hour to run on GLP
    2. Latency
      1. batch queries, cache results, optimize memory to reduce fragmentation, speculative decoding
    3. Companies working in this area include: Modular, Lambda Labs, Union, Exafunction, Modal, HippoML, Banana, SkyPilot, Texel, Paperspace, Foundry, Goose
  3. Model routing/ abstraction
    1. an abstraction layer that decides which model to fulfill the user request
    2. Companies working in this area include: Martian, Lite LLM, NotDiamond
  4. Fine-tuning & Optimization
    1. It is an evolving process with 2 components:
      1. Data ops / curation: create good data quality through data modification or augmentation (with synthetic data)
      2. Fine tuning ops: a process to orchestrate the fine-tuning process, through which model weights are updated
    2. Companies working in this area include: Manual labeling: Scale, Appen, Hive, Labelbox, Surge; Programmatic labeling: Snorkel, Watchful, Lilac; Fine-tuning ops/optimization: Arcee.ai, Lamini, Predibase, Together, Watchful, Superintel, Thirdai, GenJet, Glaive, LMFlow, Nolano
Deployment layer
1. Security & Governance
  1. at the data layer, the model can only access user-restricted data
  2. at the model layer, supervisor/firewall can try to identify and block things (but only with limited effectiveness)
  3. Companies working in this area include: Hiddenlayer, Lakera, Preamble, Vera, Credal, Fortify, Guardrail, Harmonic, Cadea, Laiyer
2. Observability & Evaluation
  1. Monitor performance degradation, system outage, or security breaches
  2. Companies working in this area include: Arthur, Arize, Fiddler, Latticeflow, Ventrilo, Gentrace, Katanemo, Helicone, Langfuse, Uptrain, Honeyhive, Whylabs
3. Product analytics
  1. analyze whether the model fulfilled the user request and user behaviors
  2. Companies working in this area include: Context.ai, Freeplay, Langfuse
4. Orchestration & LLMOps
  1. Coordinate multi-step process (e.g., API calls, state mgmt)
  2. Companies working in this area include: OctoML, Weights_and_biases, Langchain, LlamaIndex, Comet, Replicate, MindsDB, Ikigailabs, MosaicML, Qwak, Outerbounds, Continual, Fixie, Griptape, Graft, Bentoml, Steamship, Vellum, Humanloop, Patterns, Mendable, Konko_AI, Stack, Dify, Nomos, Log10, Relevance AI, Klu, GradientJ, Pyqai, Autoblocks, Superintel, Pullflow, Vellum; No-code: Swai AI, Lastmile.ai, Retune, Respell; Enterprise/GUI-based: Glean, Yurts, Dust
Interface layer
1. 3rd party API interoperability
  1. Companies working in this area include: Anon, Recall.ai, Bruinen, Induced AI, Reworkd
2. User interface - what's beyond chat?
  1. Companies working in this area include: xPrompt, Inkeep