← Back

What's under the hood of LLMs?

January 2024

The LLM stack has 4 layers:

  1. Data layer
    1. RAG (retrieval-augmented generation) allows LLMs to make predictions based on a corpus of input data, increasing contextual awareness and eliminating hallucination
      1. ELT & featurization
        1. Feature data to increase search and retrieval effectiveness. A prevalent way to do so is by breaking the data into chunks and transform them into vectors to do semantic search.
        2. Companies working in this area include: Superlinked, Unstructured, Waveline
      2. Storage
        1. Vector databases that stores these vectors for semantic search
        2. Companies working in this area include: Pinecone, Zilliz, Weaviate, Chroma, Featureform, LanceDB, pgvector
      3. Search & Retrieval
        1. During inference, information must be retrieved and fed into the model at high speed. Some retrieval methods include:
          • Simplistic – vector similarity search on naive text embeddings
          • Complex – hand-crafted rules-based systems (e.g. Github Copilot, well described in this blog post)
        2. Companies working in this area includes: Nomic, Aryn, Metal, Vald, MyScale
    2. Companies working on end-to-end data layer include: LlamaIndex, Activeloop, Inkeep, Baseplate, Vespa
  2. Model layer
    1. There are 2 future emerging paths:
      1. Large foundation model dominates
      2. Fine-tuned models for each specific use case
    2. There are 4 components in the model layer:
      1. Core model
        1. train the model from scratch ($100M spend on GPT-4) or fine-tune a pre-trained model
        2. Companies working in this area include: Google, OpenAI, Anthropic, Adept, Imbue, Cohere, Hugging Face, Stability AI, Mistral, Contextual, DeepInfra
      2. Serving & Computing
        1. Pricing
          1. GPT 3.5 costs $0.01-0.03 per query vs. GPT 4 $3
          2. Self-hosting is even more costly - Nvidia A100 cost $4-40 per hour to run on GLP
        2. Latency
          1. batch queries, cache results, optimize memory to reduce fragmentation, speculative decoding
        3. Companies working in this area include: Modular, Lambda Labs, Union, Exafunction, Modal, HippoML, Banana, SkyPilot, Texel, Paperspace, Foundry, Goose
      3. Model routing/ abstraction
        1. an abstraction layer that decides which model to fulfill the user request
        2. Companies working in this area include: Martian, Lite LLM, NotDiamond
      4. Fine-tuning & Optimization
        1. It is an evolving process with 2 components:
          1. Data ops / curation: create good data quality through data modification or augmentation (with synthetic data)
          2. Fine tuning ops: a process to orchestrate the fine-tuning process, through which model weights are updated
        2. Companies working in this area include: Manual labeling: Scale, Appen, Hive, Labelbox, Surge; Programmatic labeling: Snorkel, Watchful, Lilac; Fine-tuning ops/optimization: Arcee.ai, Lamini, Predibase, Together, Watchful, Superintel, Thirdai, GenJet, Glaive, LMFlow, Nolano
  3. Deployment layer
    1. Security & Governance
      1. at the data layer, the model can only access user-restricted data
      2. at the model layer, supervisor/firewall can try to identify and block things (but only with limited effectiveness)
      3. Companies working in this area include: Hiddenlayer, Lakera, Preamble, Vera, Credal, Fortify, Guardrail, Harmonic, Cadea, Laiyer
    2. Observability & Evaluation
      1. Monitor performance degradation, system outage, or security breaches
      2. Companies working in this area include: Arthur, Arize, Fiddler, Latticeflow, Ventrilo, Gentrace, Katanemo, Helicone, Langfuse, Uptrain, Honeyhive, Whylabs
    3. Product analytics
      1. analyze whether the model fulfilled the user request and user behaviors
      2. Companies working in this area include: Context.ai, Freeplay, Langfuse
    4. Orchestration & LLMOps
      1. Coordinate multi-step process (e.g., API calls, state mgmt)
      2. Companies working in this area include: OctoML, Weights_and_biases, Langchain, LlamaIndex, Comet, Replicate, MindsDB, Ikigailabs, MosaicML, Qwak, Outerbounds, Continual, Fixie, Griptape, Graft, Bentoml, Steamship, Vellum, Humanloop, Patterns, Mendable, Konko_AI, Stack, Dify, Nomos, Log10, Relevance AI, Klu, GradientJ, Pyqai, Autoblocks, Superintel, Pullflow, Vellum; No-code: Swai AI, Lastmile.ai, Retune, Respell; Enterprise/GUI-based: Glean, Yurts, Dust
  4. Interface layer
    1. 3rd party API interoperability
      1. Companies working in this area include: Anon, Recall.ai, Bruinen, Induced AI, Reworkd
    2. User interface - what's beyond chat?
      1. Companies working in this area include: xPrompt, Inkeep