Jacob Bennett

Inference vs RAG

Jacob Bennett June 17, 2026

RAG (Retrieval-Augmented Generation) is about what information the model has access to when generating a response. Instead of relying solely on knowledge baked into the model's weights during training, ==RAG systems fetch relevant documents or data at query time and inject them into the context window.== The model then generates a response grounded in that retrieved content. It's essentially a lookup + generate pipeline.

RAG happens at runtime, not during training
Retrieved content is explicit and inspectable (as much as the context can be inspected)
RAG is useful for keeping responses current and for referencing specific sources
The model doesn't learn anything, it just uses additional content in its context to generate a response

==Inference is the process of running a trained model (sometimes using RAG) to produce outputs.== RAG happens during inference.

[!warning] A common point of confusion

People sometimes contrast RAG with "fine-tuning" (where you retrain the model on domain-specific data so it knows something intrinsically). RAG doesn't change what the model knows. It just changes what the model sees when generating a response. Fine-tuning changes the weights of the trained model. RAG changes the context at inference time.

https://www.anthropic.com/engineering/contextual-retrieval

Discussion in the ATmosphere