Inference vs RAG
RAG (Retrieval-Augmented Generation) is about what information the model has access to when generating a response. Instead of relying solely on knowledge baked into the model's weights during training, ==RAG systems fetch relevant documents or data at query time and inject them into the context window.== The model then generates a response grounded in that retrieved content. It's essentially a lookup + generate pipeline.
- RAG happens at runtime, not during training
- Retrieved content is explicit and inspectable (as much as the context can be inspected)
- RAG is useful for keeping responses current and for referencing specific sources
- The model doesn't learn anything, it just uses additional content in its context to generate a response
==Inference is the process of running a trained model (sometimes using RAG) to produce outputs.== RAG happens during inference.
[!warning] A common point of confusion
People sometimes contrast RAG with "fine-tuning" (where you retrain the model on domain-specific data so it knows something intrinsically). RAG doesn't change what the model knows. It just changes what the model sees when generating a response. Fine-tuning changes the weights of the trained model. RAG changes the context at inference time.
Discussion in the ATmosphere