Inference vs RAG

Jacob Bennett June 17, 2026
Source

RAG (Retrieval-Augmented Generation) is about what information the model has access to when generating a response. Instead of relying solely on knowledge baked into the model's weights during training, ==RAG systems fetch relevant documents or data at query time and inject them into the context window.== The model then generates a response grounded in that retrieved content. It's essentially a lookup + generate pipeline.

  • RAG happens at runtime, not during training
  • Retrieved content is explicit and inspectable (as much as the context can be inspected)
  • RAG is useful for keeping responses current and for referencing specific sources
  • The model doesn't learn anything, it just uses additional content in its context to generate a response

==Inference is the process of running a trained model (sometimes using RAG) to produce outputs.== RAG happens during inference.

[!warning] A common point of confusion

People sometimes contrast RAG with "fine-tuning" (where you retrain the model on domain-specific data so it knows something intrinsically). RAG doesn't change what the model knows. It just changes what the model sees when generating a response. Fine-tuning changes the weights of the trained model. RAG changes the context at inference time.

https://www.anthropic.com/engineering/contextual-retrieval

Discussion in the ATmosphere

Loading comments...