{
"$type": "site.standard.document",
"bskyPostRef": {
"cid": "bafyreifun3fzrswvu64u4rvcvvzj3ovoswh2zxxfmkbxxf2fxzfhhqnpbu",
"uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mhsij2xfrlm2"
},
"path": "/t/faiss-lmdb-rag-on-a-50-year-corpus-works-great-until-you-ask-what-happened-in-2020-time-aware-retrieval-problem/174583#post_1",
"publishedAt": "2026-03-24T09:00:06.000Z",
"site": "https://discuss.huggingface.co",
"textContent": "I’m working on a RAG system over a long-span archive (~50 years), and the current retrieval stack performs well for general semantic queries.\n\nHowever, I’m struggling with **time-constrained queries** where users implicitly or explicitly expect results from a specific period.\n\nExample query:\n\n_“What happened to XYZ political party during the 2020 election?”_\n\nThe system retrieves semantically relevant content about the entity, but **fails to prioritize results within the intended time window,** even when increasing K.\n\n**System setup**\n\n**Data**\n\n * Corpus: ~1.6M documents → ~5M vectors after chunking\n\n * Language: Non-English\n\n\n\n\n**Embeddings**\n\n * Model: LaBSE - 768-dim, L2-normalized (cosine / inner product)\n\n * Chunks: cleaned text segments (noise-reduced)\n\n\n\n\n**Index & storage**\n\n * FAISS IVFPQ (primary ANN index)\n\n * Raw vectors stored in memmap (used for exact rerank on candidates)\n\n * Docstore: LMDB (ID → chunk + metadata)\n\n\n\n\n**Retrieval pipeline**\n\nQuery → decomposition → main query\n\n→ LaBSE embedding → normalized vector\n\n→ FAISS IVFPQ → top-K candidate IDs\n\n→ memmap → exact dot-product rerank → top-N\n\n→ LMDB → fetch chunks + metadata\n\n→ cross-encoder reranker → final scoring\n\n**Performance**\n\n * Recall@5 ≈ 80% (acceptable for general queries)\n\n\n\n### **Query decomposition & temporal signals**\n\nStructured signals are extracted from queries, including **timeline** (explicit years, relative dates normalized to ranges) and **main intent**.\n\nEach document chunk also contains **date metadata** , accessible at retrieval time.\n\nHowever, retrieval currently uses a **cleaned entity-focused query** , where timeline cues are intentionally removed to improve semantic matching.\n\nEven though both **query-side time constraints** and **document-side timestamps** are available, they are **not incorporated during candidate generation** , which remains purely semantic.\n\n**What I tried / considered**\n\n**1. Increasing K and Post-retrieval filtering based on**\n\n * Issue:\n\n * Still diluted across decades\n\n * Not reliable for narrow time windows\n\n * Risk of losing true relevant docs\n\n\n\n\n**2. Time-based sharding (design idea)**\n\n * Split vector store into **year-wise (or period-wise) shards**\n\n * Route query to relevant shard(s)\n\n * Maintain one **global store** for generic queries\nIssues:\n\n * Requires deeper changes in retrieval + reranking flow\n\n * Operational overhead (multiple indices)\n\n\n\n\n**Question**\n\n 1. How do production systems typically handle:\n\n * “entity + time window” queries at scale\n\n * without sacrificing recall or blowing up latency?\n\n 2. Is **pre-filtering (via sharding or partitioned indices)** generally preferred over post-filtering for time-constrained queries?\n\n\n",
"title": "FAISS + LMDB RAG on a 50-year corpus works great — until you ask ‘what happened in 2020?’ (time-aware retrieval problem)"
}