Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreihcqt4ctvzoafoccoxjuitjuazt5izxpxcc3jt6nrjilbrsxpv7b4",
    "uri": "at://did:plc:25rdn5elo5izoxrmtis34zuk/app.bsky.feed.post/3mpomeuiacy72"
  },
  "coverImage": {
    "$type": "blob",
    "ref": {
      "$link": "bafkreib2pw2lpyry72bj4275unaxyv63qkthxbfzafxybhgsaoiw44bqzi"
    },
    "mimeType": "image/webp",
    "size": 121342
  },
  "path": "/sumanpro/practical-rag-part-1-the-simplest-rag-that-actually-works-4hm1",
  "publishedAt": "2026-07-02T17:28:58.000Z",
  "site": "https://dev.to",
  "tags": [
    "rag",
    "python",
    "llm",
    "ai",
    "https://www.kaggle.com/code/sumannath88/ep01-simple-rag",
    "OpenRouter"
  ],
  "textContent": "_By Suman — Part 1 of the **Practical RAG_ * series. All code is in a runnable notebook: https://www.kaggle.com/code/sumannath88/ep01-simple-rag\n\nEveryone talks about RAG. Far fewer people have built the _simplest_ version end to end and looked at exactly where it falls over.\n\nThat's what this series does. We start with the most naive RAG pipeline that actually works, understand it completely, and then — one concrete problem at a time — make it better. No frameworks hiding the moving parts. Just Python you can read.\n\nBy the end of this post you'll have a working pipeline in about 40 lines that answers questions correctly — and you'll understand exactly why that success is misleading. Those hidden weaknesses are the roadmap for the rest of the series.\n\n##  What RAG actually is\n\nRAG — Retrieval-Augmented Generation — is one idea: **before you ask the model a question, go find relevant text and paste it into the prompt.** That's it. The \"retrieval\" finds the text; the \"generation\" is the LLM answering with that text in front of it.\n\nWhy bother? Because it lets a model answer questions about _your_ data — documents it was never trained on — without fine-tuning, and it grounds answers in real sources instead of the model's memory.\n\nThe naive pipeline has five steps:\n\n  1. **Load** your documents\n  2. **Chunk** them into pieces\n  3. **Embed** each chunk into a vector\n  4. **Retrieve** the chunks most similar to the question\n  5. **Generate** an answer with those chunks as context\n\n\n\nLet's build each one.\n\n##  Setup\n\nWe'll use local embeddings (via `sentence-transformers`) so retrieval is free and needs no API key, and OpenRouter for generation because it exposes an OpenAI-compatible API across many models.\n\n\n\n    pip install sentence-transformers openai numpy\n\n\n\n    import os\n    import numpy as np\n\n    # On Kaggle, store OPENROUTER_API_KEY as a notebook Secret; elsewhere use an\n    # env var or paste it inline.\n    try:\n        from kaggle_secrets import UserSecretsClient\n        os.environ.setdefault(\n            \"OPENROUTER_API_KEY\",\n            UserSecretsClient().get_secret(\"OPENROUTER_API_KEY\"),\n        )\n    except ModuleNotFoundError:\n        os.environ.setdefault(\"OPENROUTER_API_KEY\", \"sk-or-...\")  # your key\n\n    LLM_MODEL   = \"deepseek/deepseek-v4-flash\"\n    EMBED_MODEL = \"sentence-transformers/all-MiniLM-L6-v2\"\n    TOP_K = 3\n\n\n> The notebook runs on Kaggle, Colab, or locally. Embeddings are computed locally, so only generation touches the network.\n\n##  1 & 2. Load and chunk\n\nTo keep everything self-contained, our \"corpus\" is a handful of short passages about planets. And our chunking strategy is the simplest one imaginable: **one chunk per document.**\n\n\n\n    DOCUMENTS = [\n        \"Mercury is the smallest planet ... no moons ...\",\n        \"Venus is the hottest planet ... 465 degrees Celsius.\",\n        \"Earth ... the only known world with liquid water and life ...\",\n        \"Mars ... two small moons, Phobos and Deimos.\",\n        \"Jupiter is the largest planet ... at least 95 known moons.\",\n        \"Saturn ... famous for its prominent ring system ...\",\n    ]\n    chunks = DOCUMENTS  # naive: each doc is one chunk\n\n\nThis is fine because the passages are already short. Hold onto that caveat — it's the first thing that breaks on real data.\n\n##  3. Embed\n\nAn embedding turns text into a vector of numbers such that similar meanings land near each other in space. We compute one vector per chunk, once, up front.\n\n\n\n    from sentence_transformers import SentenceTransformer\n\n    embedder = SentenceTransformer(EMBED_MODEL)\n    chunk_embeddings = embedder.encode(chunks, normalize_embeddings=True)\n\n\nWe normalize the vectors so that cosine similarity — the standard measure of \"how close are these two meanings\" — collapses to a plain dot product.\n\n##  4. Retrieve\n\nTo answer a question, embed the question the same way, score it against every chunk, and keep the top _k_.\n\n\n\n    def retrieve(question, k=TOP_K):\n        q_emb = embedder.encode([question], normalize_embeddings=True)[0]\n        scores = chunk_embeddings @ q_emb        # cosine similarity\n        top_idx = np.argsort(scores)[::-1][:k]\n        return [(chunks[i], float(scores[i])) for i in top_idx]\n\n\nAsk _\"Which planet has the most moons?\"_ and the Jupiter chunk comes back on top. No LLM involved yet — this is pure vector search.\n\n##  5. Generate\n\nNow stitch the retrieved chunks into a prompt and ask the model — instructing it to answer **only** from the provided context. That instruction is the heart of RAG discipline: it's what keeps the model grounded instead of guessing.\n\n\n\n    from openai import OpenAI\n\n    client = OpenAI(base_url=\"https://openrouter.ai/api/v1\",\n                    api_key=os.environ[\"OPENROUTER_API_KEY\"])\n\n    def answer(question, k=TOP_K):\n        retrieved = retrieve(question, k)\n        context = \"\\n\\n\".join(f\"[{i+1}] {c}\" for i, (c, _) in enumerate(retrieved))\n        prompt = (\n            \"Answer the question using ONLY the context below. \"\n            \"If the answer is not in the context, say you don't know.\\n\\n\"\n            f\"Context:\\n{context}\\n\\nQuestion: {question}\\nAnswer:\"\n        )\n        resp = client.chat.completions.create(\n            model=LLM_MODEL,\n            messages=[{\"role\": \"user\", \"content\": prompt}],\n            temperature=0,\n        )\n        return resp.choices[0].message.content, retrieved\n\n\n\n    answer(\"Which planet has the most moons?\")[0]\n    # -> \"Jupiter, with at least 95 known moons.\"\n\n\nThat's a complete RAG system. Load → chunk → embed → retrieve → generate.\n\n##  It works — and that's the trap\n\nHere's the twist: this pipeline handles the hard-looking questions just fine.\n\n**A question outside the corpus:**\n\n\n\n    answer(\"How far is Pluto from the Sun?\")[0]\n    # -> \"I don't know.\"\n\n\nPluto isn't in our documents, and the model correctly refuses to invent an answer. Grounding is doing its job.\n\n**A comparison spanning two chunks:**\n\n\n\n    answer(\"Which is hotter, Venus or Mercury, and why?\")[0]\n    # -> \"Venus is hotter (~465°C) because its thick CO2 atmosphere traps heat,\n    #     while Mercury has almost no atmosphere.\"\n\n\nThe answer lives across two chunks, and top-_k_ retrieval pulls both. Correct, and even well-reasoned.\n\nSo naive RAG _works_. It works flawlessly. And that is exactly the problem — because it's working on six clean, short, hand-picked paragraphs. A small, tidy corpus hides every weakness the technique has.\n\n##  The weaknesses hiding behind the demo — and the roadmap\n\nClean answers on toy data prove almost nothing. Each of these breaks the moment you point naive RAG at real documents, and each is exactly what a later part of the series fixes:\n\n  * **Chunking is naive.** One-chunk-per-document collapses when documents are long — the right passage gets buried in noise or split apart.\n  * **Retrieval is purely semantic.** Exact keywords — names, IDs, error codes — can slip past vector similarity. Hybrid (keyword + vector) search helps.\n  * **No reranking.** With hundreds of chunks, the top _k_ by cosine similarity aren't reliably the most _useful_ k.\n  * **No evaluation.** We're eyeballing two answers. Without numbers, we can't tell whether any \"improvement\" actually improved anything.\n\n\n\n**Part 2** takes on chunking and retrieval quality — and adds a small evaluation harness so every change from here on is measurable.\n\nThe full runnable notebook for this part is here: https://www.kaggle.com/code/sumannath88/ep01-simple-rag\n\nIf this was useful, follow along — the series gets more interesting as the naive version starts to hurt.\n\n_Next: Part 2 — Better chunks, hybrid retrieval, and how to actually measure RAG._",
  "title": "Practical RAG, Part 1: The Simplest RAG That Actually Works"
}