Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreibk7htk2kagx6svn2mibbhxxtl57z2in7sss4gss4swqvungpss7y",
    "uri": "at://did:plc:25rdn5elo5izoxrmtis34zuk/app.bsky.feed.post/3mp6uwxzelpa2"
  },
  "coverImage": {
    "$type": "blob",
    "ref": {
      "$link": "bafkreihc5dzfohccaskvbwuaqz2oxqlq624rzckrq7ids6qml6tp3p2e7q"
    },
    "mimeType": "image/webp",
    "size": 84200
  },
  "path": "/derrickryangiggs/i-built-a-hybrid-search-engine-from-scratch-heres-what-i-learned-llm-zoomcamp-2026-module-2-3jdj",
  "publishedAt": "2026-06-26T11:49:10.000Z",
  "site": "https://dev.to",
  "tags": [
    "rag",
    "llm",
    "vectordatabase",
    "datatalksclub",
    "github.com/Derrick-Ryan-Giggs/llm-zoomcamp-2026",
    "github.com/DataTalksClub/llm-zoomcamp"
  ],
  "textContent": "I just completed Module 2 of the **LLM Zoomcamp 2026** by @DataTalksClub — and this module completely changed how I think about search.\n\nModule 1 taught me RAG and agentic pipelines. Module 2 taught me that the search step inside RAG matters far more than I realized — and that keyword search is only half the story.\n\nHere's everything I built and learned.\n\n##  What Is Vector Search and Why Does It Matter?\n\nTraditional keyword search matches words. If you search for \"enroll\", it finds documents containing \"enroll\" — but misses documents about \"joining\", \"signing up\", or \"registration\" even if they mean exactly the same thing.\n\n**Vector search matches meaning, not words.**\n\nEvery piece of text gets converted into a vector — a list of hundreds of numbers that captures its semantic meaning. Similar meanings produce similar vectors, so you can find relevant documents even when they use completely different words.\n\nThis is the foundation of modern AI-powered search, and it's what makes RAG systems actually work at scale.\n\n##  What I Built in Module 2\n\n###  1. Text Embeddings with a Lightweight ONNX Model\n\nInstead of downloading the full PyTorch + CUDA stack (~2GB), I used a lightweight ONNX runtime embedder — same vectors, 30x smaller installation, runs on any CPU:\n\n\n\n    from embedder import Embedder\n\n    embedder = Embedder()  # loads Xenova/all-MiniLM-L6-v2 via ONNX\n    v = embedder.encode(\"How does approximate nearest neighbor search work?\")\n    print(len(v))  # 384 dimensions\n\n\nThe model produces **384-dimensional vectors** — each number represents a dimension of meaning in the text.\n\n###  2. Vector Search From Scratch with NumPy\n\nBefore using any library, I implemented vector search by hand to understand what's happening under the hood:\n\n\n\n    import numpy as np\n\n    # cosine similarity — vectors are normalized, so dot product works directly\n    def cosine_similarity(a, b):\n        return np.dot(a, b)\n\n    # score all chunks against a query\n    scores = X.dot(v)  # X is the matrix of all chunk embeddings\n    best_idx = np.argmax(scores)\n\n\nThis is exactly what vector databases like Qdrant and pgvector do internally — just much faster at scale using HNSW indexing.\n\n###  3. Chunking Long Documents for Better Retrieval\n\nFull pages are too long and dilute the embedding — a match buried deep in a 10,000-character page still pulls in the whole page. The fix is chunking:\n\n\n\n    from gitsource import chunk_documents\n    chunks = chunk_documents(documents, size=2000, step=1000)\n    # 72 pages → 295 overlapping chunks\n\n\nOverlapping chunks (step < size) ensure sentences at boundaries don't get cut off. After chunking, retrieval becomes far more precise.\n\n###  4. Vector Search with minsearch\n\n`minsearch` now has a `VectorSearch` class that wraps the numpy math into a clean interface:\n\n\n\n    from minsearch import VectorSearch\n\n    vector_index = VectorSearch(keyword_fields=[\"filename\"])\n    vector_index.fit(X, chunks)\n\n    results = vector_index.search(query_vector, num_results=5)\n\n\n###  5. Comparing Keyword vs Vector Search\n\nFor the query **\"How do I store vectors in PostgreSQL?\"** :\n\n  * **Keyword search** — missed `08-pgvector.md` entirely because \"pgvector\" wasn't in the query\n  * **Vector search** — ranked `08-pgvector.md` first because it understood the semantic connection between \"store vectors\" and \"pgvector\"\n\n\n\nThis is the key insight: vector search finds meaning, keyword search finds words.\n\n###  6. Hybrid Search with Reciprocal Rank Fusion (RRF)\n\nNeither approach is perfect on its own:\n\n  * Vector search can miss exact terms, names, and rare keywords\n  * Keyword search misses paraphrases and synonyms\n\n\n\nThe solution is **hybrid search** — run both and merge the results using RRF:\n\n\n\n    def rrf(result_lists, k=60, num_results=5):\n        scores = {}\n        docs = {}\n        for results in result_lists:\n            for rank, doc in enumerate(results):\n                key = (doc[\"filename\"], doc[\"start\"])\n                scores[key] = scores.get(key, 0) + 1 / (k + rank)\n                docs[key] = doc\n        ranked = sorted(scores, key=scores.get, reverse=True)\n        return [docs[key] for key in ranked[:num_results]]\n\n    results = rrf([vector_results, text_results])\n\n\nRRF ignores raw scores (which live on different scales) and only looks at rank position. A document that ranks well in both lists beats one that's only strong in a single list — even if it wasn't first in either.\n\n##  Key Takeaways\n\n**1. Embeddings capture meaning, not words.** \"Enroll\" and \"join\" produce similar vectors. \"Pizza\" and \"enrollment\" don't. This is what makes semantic search powerful.\n\n**2. Chunking is not optional.** Full pages dilute embeddings. 2,000-character overlapping chunks dramatically improve retrieval precision and cut LLM input tokens by 3x.\n\n**3. Neither keyword nor vector search is best.** Use hybrid search (RRF) in production. It consistently outperforms either approach alone.\n\n**4. ONNX makes embeddings practical anywhere.** No GPU, no PyTorch, no CUDA. 67MB download, runs on a basic laptop. There's no reason not to use vector search even in constrained environments.\n\n**5. The right search approach depends on your data.** Vector search wins for semantic queries. Keyword search wins for exact terms (names, codes, IDs). Hybrid wins most of the time — but measure to be sure.\n\n##  My Homework Solution\n\nAll my code for Module 2 is open source:\n\n**github.com/Derrick-Ryan-Giggs/llm-zoomcamp-2026**\n\nIt includes:\n\n  * `vector-search.ipynb` — embeddings, Qdrant, and vector RAG pipeline\n  * `Vector Search Homework.ipynb`\n\n\n\n##  Want to Learn Too?\n\nLLM Zoomcamp is **completely free** — no paywall, no certificate fees.\n\nSign up: github.com/DataTalksClub/llm-zoomcamp\n\n_Are you working through LLM Zoomcamp 2026? Drop a comment — I'd love to compare notes._",
  "title": "I Built a Hybrid Search Engine From Scratch — Here's What I Learned (LLM Zoomcamp 2026, Module 2)"
}