{
"$type": "site.standard.document",
"bskyPostRef": {
"cid": "bafyreicgelbgfqi2hwarp4222ettxs4ecwlm4wagqf4yumlwdnkxjbtvqy",
"uri": "at://did:plc:25rdn5elo5izoxrmtis34zuk/app.bsky.feed.post/3mom7jnejjdc2"
},
"coverImage": {
"$type": "blob",
"ref": {
"$link": "bafkreia2uqkqctd7xjycvp5rjjln3s5wlaw2eeelf4aucys6fkbbpkev6e"
},
"mimeType": "image/webp",
"size": 477114
},
"path": "/shayan_holakouee/vector-databases-are-not-magic-heres-whats-actually-happening-under-the-hood-566c",
"publishedAt": "2026-06-19T01:28:13.000Z",
"site": "https://dev.to",
"tags": [
"ai",
"vectordatabase",
"python",
"machinelearning",
"dev.to",
"@dataclass"
],
"textContent": "You've seen the tutorials. Spin up Pinecone, call `.upsert()`, do a similarity search, ship it. Everyone claps. The demo works.\n\nThen you take it to production and it starts lying to you.\n\nResults that look semantically relevant but aren't. Queries that should match something and return nothing. Latency that makes your users think the app crashed. And the worst part - you don't know why, because the vector database feels like a black box with a fancy API.\n\nThis article is about opening that box.\n\n## What a Vector Database Actually Is\n\nLet's be honest about what \"vector database\" means, because the term is doing a lot of marketing work right now.\n\nAt its core, a vector database is an index optimized for **approximate nearest neighbor (ANN) search** over high-dimensional float arrays. That's it. The \"database\" part - persistence, CRUD, filtering, transactions - is infrastructure wrapped around that core capability.\n\nWhen you store an embedding, you're storing a point in N-dimensional space (typically 768, 1536, or 3072 dimensions depending on your model). When you query, you're asking: _\"which stored points are closest to this query point, by some distance metric?\"_\n\nThe challenge? Doing exact nearest neighbor search at scale is `O(N * D)` - linear in your corpus size times the dimensionality. For a million 1536-dim vectors, that's ~6 billion float comparisons per query. At millisecond latency requirements, that's a hard no.\n\n**ANN algorithms trade a small amount of accuracy for massive speed gains.** Understanding this trade-off is the first thing most tutorials skip - and it's where production bugs hide.\n\n## The Index Is the Product\n\nThe algorithm your vector DB uses to build its index determines everything: speed, recall, memory usage, and how it degrades under pressure.\n\n### HNSW (Hierarchical Navigable Small World)\n\nThis is what most modern vector DBs use by default (Qdrant, Weaviate, Milvus, pgvector with the right extension). HNSW builds a **multi-layer graph** where:\n\n * The top layer is sparse - only a few highly-connected \"hub\" nodes\n * Each lower layer gets progressively denser\n * Querying starts at the top and greedily navigates down toward the nearest neighbor\n\n\n\nThink of it like a highway system. You jump on the highway (top layer), drive toward your destination, exit at the right interchange, and then use local streets (bottom layer) for precision.\n\n**Key parameters you need to know:**\n\n\n\n # Qdrant example\n from qdrant_client.models import VectorParams, Distance\n\n client.create_collection(\n collection_name=\"my_docs\",\n vectors_config=VectorParams(\n size=1536,\n distance=Distance.COSINE,\n hnsw_config={\n \"m\": 16, # Number of edges per node. Higher = better recall, more memory\n \"ef_construct\": 100, # Construction-time beam width. Higher = better index quality, slower build\n }\n )\n )\n\n # At query time\n results = client.search(\n collection_name=\"my_docs\",\n query_vector=query_embedding,\n limit=10,\n search_params={\"ef\": 128} # Runtime beam width. Higher = better recall, slower query\n )\n\n\n`m` and `ef_construct` are set at build time and can't change without rebuilding your index. If you're seeing poor recall in production and you set `m=4` to save memory, that's your culprit.\n\n### IVF (Inverted File Index)\n\nUsed by FAISS and as an option in pgvector. Divides the vector space into Voronoi cells (clusters), assigns vectors to their nearest centroid, then searches only a subset of cells at query time.\n\n\n\n # FAISS IVF example\n import faiss\n import numpy as np\n\n dimension = 1536\n n_clusters = 1024 # Number of Voronoi cells\n\n quantizer = faiss.IndexFlatL2(dimension)\n index = faiss.IndexIVFFlat(quantizer, dimension, n_clusters)\n\n # Must train before adding vectors\n index.train(training_vectors) # Needs representative data\n index.add(corpus_vectors)\n\n # nprobe = how many cells to search. More = better recall, slower\n index.nprobe = 32\n distances, indices = index.search(query_vector, k=10)\n\n\n**IVF gotcha:** the cluster centroids are learned during training. If your data distribution shifts significantly (new document types, different topics), your centroid structure becomes suboptimal and recall tanks. You don't get an error. You just quietly get worse results.\n\n## Distance Metrics: You're Probably Using the Wrong One\n\nMost people use cosine similarity because the tutorial said so. Here's when that's wrong.\n\nMetric | Formula | Use When\n---|---|---\nCosine | `1 - (A·B / ‖A‖‖B‖)` | Direction matters, magnitude doesn't. Good for normalized text embeddings\nDot Product | `-(A·B)` | Embeddings are already normalized (OpenAI's are). Faster than cosine\nEuclidean (L2) | `‖A-B‖` | Magnitude carries meaning. Image embeddings, some multimodal models\n\nOpenAI's `text-embedding-3-*` embeddings are normalized to unit length. Cosine similarity on unit vectors is mathematically equivalent to dot product. Using cosine adds a normalization step that's pure overhead.\n\n\n\n # If you're using OpenAI embeddings, use dot product\n # In Qdrant:\n VectorParams(size=1536, distance=Distance.DOT)\n\n # In pgvector:\n # Use <=> for cosine, <#> for negative inner product (dot), <-> for L2\n SELECT content, embedding <#> query_embedding AS score\n FROM documents\n ORDER BY score\n LIMIT 10;\n\n\nThe difference in latency is small at low scale. At 10M+ vectors, it's measurable.\n\n## The Recall Problem Nobody Talks About\n\nHere's a thing that will haunt you: **your ANN search does not always return the true nearest neighbors.**\n\nIt returns _approximate_ nearest neighbors. That's the A in ANN. By definition, you may miss results that should have ranked in your top-K.\n\nHow bad is it? It depends on your index config and your data. You can measure it:\n\n\n\n import numpy as np\n from qdrant_client import QdrantClient\n\n def measure_recall(client, collection_name, test_queries, ground_truth_ids, k=10):\n \"\"\"\n Compare ANN results against brute-force exact search.\n ground_truth_ids: list of lists, true top-k ids per query\n \"\"\"\n hits = 0\n total = len(test_queries) * k\n\n for query, true_ids in zip(test_queries, ground_truth_ids):\n ann_results = client.search(\n collection_name=collection_name,\n query_vector=query,\n limit=k\n )\n ann_ids = {r.id for r in ann_results}\n hits += len(ann_ids & set(true_ids))\n\n return hits / total # recall@k\n\n\n # A well-tuned index should hit 0.95+ recall@10\n # If you're at 0.85 or below, tune ef or m\n\n\nProduction target: **≥ 0.95 recall@10**. Anything below that and your RAG pipeline is silently missing relevant context before GPT-4 ever sees it.\n\n## Hybrid Search: The Architecture You Should Actually Be Using\n\nPure vector search has a well-known failure mode: **it doesn't handle rare terms well.**\n\nIf your corpus contains \"RFC 7807 Problem Details\" or a specific error code like `E_INVALIDARG_0x80070057`, embedding similarity will dilute the match across semantically adjacent concepts. A user querying for the exact string gets mushy results.\n\nThe solution is **hybrid search** : combine dense vector search with sparse BM25-style keyword search, then fuse the rankings.\n\n\n\n from qdrant_client import QdrantClient\n from qdrant_client.models import (\n SparseVectorParams, VectorParams,\n SparseIndexParams, Distance, NamedVector, NamedSparseVector\n )\n\n # Qdrant supports both dense and sparse vectors natively\n client.create_collection(\n collection_name=\"hybrid_docs\",\n vectors_config={\n \"dense\": VectorParams(size=1536, distance=Distance.COSINE),\n },\n sparse_vectors_config={\n \"sparse\": SparseVectorParams(index=SparseIndexParams(on_disk=False))\n }\n )\n\n # At insert time, generate both representations\n from fastembed import SparseTextEmbedding, TextEmbedding\n\n dense_model = TextEmbedding(\"BAAI/bge-small-en-v1.5\")\n sparse_model = SparseTextEmbedding(\"prithivida/Splade_PP_en_v1\")\n\n text = \"RFC 7807 Problem Details for HTTP APIs\"\n dense_vec = list(dense_model.embed([text]))[0]\n sparse_vec = list(sparse_model.embed([text]))[0]\n\n # At query time, use Reciprocal Rank Fusion (RRF)\n from qdrant_client.models import Prefetch, FusionQuery, Fusion\n\n results = client.query_points(\n collection_name=\"hybrid_docs\",\n prefetch=[\n Prefetch(query=dense_vec.tolist(), using=\"dense\", limit=20),\n Prefetch(\n query=SparseVector(indices=sparse_vec.indices.tolist(),\n values=sparse_vec.values.tolist()),\n using=\"sparse\", limit=20\n ),\n ],\n query=FusionQuery(fusion=Fusion.RRF),\n limit=10\n )\n\n\n**RRF (Reciprocal Rank Fusion)** combines the rank lists without needing score normalization. The formula is simple:\n\n\n\n RRF_score(d) = Σ 1 / (k + rank_i(d))\n\n\nWhere `k` is a constant (usually 60) and `rank_i(d)` is the document's rank in each result list. Documents appearing in both lists get a significant boost.\n\nHybrid search consistently outperforms pure dense search on real-world corpora by **5–15% on NDCG@10** - especially for domain-specific or technical content.\n\n## Metadata Filtering: The Performance Trap\n\nVector DBs let you pre-filter by metadata before (or after) the ANN search. This sounds simple. It's actually one of the most common performance footguns.\n\n**Pre-filtering** (filter before ANN): Apply your metadata filter first, reduce the candidate set, then run ANN on the smaller set.\n\nProblem: if your filter is very selective (e.g., `user_id = \"abc123\"` in a multi-tenant system), the candidate set might be tiny. HNSW graph navigation assumes a large, connected graph. A sparse subgraph destroys recall.\n\n**Post-filtering** (ANN then filter): Run ANN on the full corpus, retrieve top-N, then apply filter. You need to over-fetch significantly to compensate for filtered-out results.\n\n\n\n # Qdrant handles this with \"indexed\" payload fields\n # Always index fields you filter on\n client.create_payload_index(\n collection_name=\"my_docs\",\n field_name=\"tenant_id\",\n field_schema=\"keyword\" # or \"integer\", \"float\", \"geo\"\n )\n\n # Qdrant uses a smart filtering strategy:\n # If filter is selective → brute force on filtered set\n # If filter is broad → HNSW with post-filter\n # It decides automatically based on cardinality estimates\n\n results = client.search(\n collection_name=\"my_docs\",\n query_vector=query_embedding,\n query_filter=Filter(\n must=[FieldCondition(key=\"tenant_id\", match=MatchValue(value=\"abc123\"))]\n ),\n limit=10\n )\n\n\n**Rule of thumb:** if your filter reduces the corpus below ~1000 vectors, you're effectively doing brute-force search. That's fine - just know it and set expectations accordingly.\n\n## The Chunking Strategy You Need to Revisit\n\nThis isn't vector DB internals, but it's so deeply related that skipping it would be malpractice.\n\nYour retrieval quality is bounded by your chunking quality. The vector DB can only return what you gave it.\n\nMost tutorials show:\n\n\n\n # The naïve approach that everyone copies\n text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)\n chunks = text_splitter.split_text(document)\n\n\nThe problems:\n\n * Fixed-size chunks break semantic units arbitrarily\n * A sentence spanning a chunk boundary gets split into two orphaned halves\n * 500 tokens might be too large for precise retrieval, too small for necessary context\n\n\n\n**Better: semantic chunking**\n\n\n\n from langchain_experimental.text_splitter import SemanticChunker\n from langchain_openai import OpenAIEmbeddings\n\n splitter = SemanticChunker(\n OpenAIEmbeddings(),\n breakpoint_threshold_type=\"percentile\",\n breakpoint_threshold_amount=95 # Split when semantic shift exceeds 95th percentile\n )\n\n chunks = splitter.split_text(document)\n\n\nThis embeds sentences, calculates cosine distance between adjacent sentence pairs, and splits at significant semantic shifts.\n\n**Even better: store both chunk and parent document**\n\n\n\n # \"Small-to-big\" or \"Parent Document Retrieval\"\n # Store small chunks for precise matching\n # But return the parent document (or larger window) as context\n\n from langchain.retrievers import ParentDocumentRetriever\n from langchain.storage import InMemoryStore\n\n child_splitter = RecursiveCharacterTextSplitter(chunk_size=200)\n parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000)\n\n retriever = ParentDocumentRetriever(\n vectorstore=vectorstore,\n docstore=InMemoryStore(),\n child_splitter=child_splitter,\n parent_splitter=parent_splitter,\n )\n\n\nSmall chunks match with high precision. The returned context is the larger parent - so your LLM gets enough surrounding information to reason correctly.\n\n## Observability: What You Should Be Logging\n\nIf you're not measuring this stuff, you're flying blind:\n\n\n\n import time\n from dataclasses import dataclass\n from typing import Optional\n\n @dataclass\n class RetrievalTrace:\n query: str\n query_embedding_ms: float\n search_ms: float\n num_results: int\n top_score: float\n bottom_score: float\n score_spread: float # top - bottom; low spread = retrieval is uncertain\n filter_applied: Optional[dict]\n collection_name: str\n\n def traced_search(client, collection_name, query_text, embed_fn, k=5, filter=None):\n t0 = time.perf_counter()\n embedding = embed_fn(query_text)\n embed_ms = (time.perf_counter() - t0) * 1000\n\n t1 = time.perf_counter()\n results = client.search(\n collection_name=collection_name,\n query_vector=embedding,\n limit=k,\n query_filter=filter\n )\n search_ms = (time.perf_counter() - t1) * 1000\n\n scores = [r.score for r in results]\n trace = RetrievalTrace(\n query=query_text,\n query_embedding_ms=embed_ms,\n search_ms=search_ms,\n num_results=len(results),\n top_score=scores[0] if scores else 0,\n bottom_score=scores[-1] if scores else 0,\n score_spread=(scores[0] - scores[-1]) if len(scores) > 1 else 0,\n filter_applied=filter,\n collection_name=collection_name\n )\n\n # Ship to your observability stack (Datadog, Langfuse, custom)\n log_trace(trace)\n return results\n\n\n**What to watch:**\n\n * `score_spread` near 0 means all results look equally similar - the query probably didn't match anything well\n * `top_score` below your threshold (tune per model, but ~0.75 for cosine is a reasonable starting floor) means you're returning noise\n * Embedding latency spikes often precede throttling errors from your embedding provider\n\n\n\n## The Stack Decision\n\nQuick opinionated guide for 2026:\n\nScenario | Recommendation\n---|---\nPrototype / hobby | ChromaDB (in-process, zero infra)\nProduction, self-hosted | Qdrant (best performance, Rust core, Docker-native)\nAlready on Postgres | pgvector + pgvectorscale\nEnterprise, managed | Pinecone or Weaviate Cloud\nNeed multimodal (text + image) | Weaviate or Milvus\nMassive scale (100M+ vectors) | Milvus or Pinecone\n\nDon't use a vector DB for everything. If your corpus is under ~10,000 documents, cosine search over an in-memory numpy array with `np.dot` is fast enough and eliminates an entire infrastructure dependency.\n\n\n\n import numpy as np\n\n corpus_embeddings = np.load(\"embeddings.npy\") # shape: (N, 1536)\n query_embedding = np.array(embed(query)) # shape: (1536,)\n\n # Cosine similarity (assuming normalized vectors)\n scores = corpus_embeddings @ query_embedding\n top_k_indices = np.argsort(scores)[::-1][:10]\n\n\nNo database. No network calls. No ops burden. Just math.\n\n## What This Means for Your RAG Pipeline\n\nPull all of this together and you get a mental model for diagnosing RAG failures:\n\n 1. **LLM gives wrong answer despite having the right docs?** → Generation problem, not retrieval\n 2. **Right docs never appear in retrieved context?** → Check recall, check chunking, check distance metric\n 3. **Results feel semantically correct but factually off?** → Your chunks are too large; precision is suffering\n 4. **Exact terms missing from results?** → You need hybrid search\n 5. **Multi-tenant data leaking across users?** → Your metadata filter is wrong or not indexed\n 6. **Works in dev, breaks in prod?** → Data distribution shift. Retrain/rebuild index or tune `ef`/`nprobe`\n\n\n\nVector databases are not magic retrieval oracles. They're approximate spatial indexes with a product wrapper. Once you understand the approximation, the trade-offs, and the failure modes - you can actually build reliable systems with them.\n\n_If this was useful, I write about Python backend and AI engineering on dev.to. The good stuff is in the details._",
"title": "Vector Databases Are Not Magic, Here's What's Actually Happening Under the Hood"
}