{
  "$type": "site.standard.document",
  "canonicalUrl": "https://jacob.blog/notes/inference-vs-rag",
  "description": "Defining two terms that often get conflated",
  "path": "/notes/inference-vs-rag",
  "publishedAt": "2026-06-17T14:46:11.393Z",
  "site": "at://did:plc:ckthoyuvsmkp254fyuinyzb2/site.standard.publication/3mndm6tiamb26",
  "tags": [
    "ai"
  ],
  "textContent": "RAG (Retrieval-Augmented Generation) is about what information the model has access to when generating a response. Instead of relying solely on knowledge baked into the model's weights during training, ==RAG systems fetch relevant documents or data at query time and inject them into the context window.== The model then generates a response grounded in that retrieved content. It's essentially a lookup + generate pipeline.\n\n- RAG happens at _runtime_, not during training\n- Retrieved content is explicit and inspectable (as much as the context can be inspected)\n- RAG is useful for keeping responses _current_ and for referencing _specific sources_\n- The model doesn't learn anything, it just uses additional content in its context to generate a response\n\n==Inference is the process of running a trained model (sometimes using RAG) to produce outputs.== RAG happens _during_ inference.\n\n> [!warning] A common point of confusion\n>\n> People sometimes contrast RAG with \"fine-tuning\" (where you retrain the model on domain-specific data so it knows something intrinsically). RAG doesn't change what the model knows. It just changes what the model sees when generating a response. Fine-tuning changes the weights of the trained model. RAG changes the context at inference time.\n\n<https://www.anthropic.com/engineering/contextual-retrieval>",
  "title": "Inference vs RAG"
}