Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreifun3fzrswvu64u4rvcvvzj3ovoswh2zxxfmkbxxf2fxzfhhqnpbu",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mhsij2xfrlm2"
  },
  "path": "/t/faiss-lmdb-rag-on-a-50-year-corpus-works-great-until-you-ask-what-happened-in-2020-time-aware-retrieval-problem/174583#post_1",
  "publishedAt": "2026-03-24T09:00:06.000Z",
  "site": "https://discuss.huggingface.co",
  "textContent": "I’m working on a RAG system over a long-span archive (~50 years), and the current retrieval stack performs well for general semantic queries.\n\nHowever, I’m struggling with **time-constrained queries** where users implicitly or explicitly expect results from a specific period.\n\nExample query:\n\n_“What happened to XYZ political party during the 2020 election?”_\n\nThe system retrieves semantically relevant content about the entity, but **fails to prioritize results within the intended time window,** even when increasing K.\n\n**System setup**\n\n**Data**\n\n  * Corpus: ~1.6M documents → ~5M vectors after chunking\n\n  * Language: Non-English\n\n\n\n\n**Embeddings**\n\n  * Model: LaBSE - 768-dim, L2-normalized (cosine / inner product)\n\n  * Chunks: cleaned text segments (noise-reduced)\n\n\n\n\n**Index & storage**\n\n  * FAISS IVFPQ (primary ANN index)\n\n  * Raw vectors stored in memmap (used for exact rerank on candidates)\n\n  * Docstore: LMDB (ID → chunk + metadata)\n\n\n\n\n**Retrieval pipeline**\n\nQuery → decomposition → main query\n\n→ LaBSE embedding → normalized vector\n\n→ FAISS IVFPQ → top-K candidate IDs\n\n→ memmap → exact dot-product rerank → top-N\n\n→ LMDB → fetch chunks + metadata\n\n→ cross-encoder reranker → final scoring\n\n**Performance**\n\n  * Recall@5 ≈ 80% (acceptable for general queries)\n\n\n\n### **Query decomposition & temporal signals**\n\nStructured signals are extracted from queries, including **timeline** (explicit years, relative dates normalized to ranges) and **main intent**.\n\nEach document chunk also contains **date metadata** , accessible at retrieval time.\n\nHowever, retrieval currently uses a **cleaned entity-focused query** , where timeline cues are intentionally removed to improve semantic matching.\n\nEven though both **query-side time constraints** and **document-side timestamps** are available, they are **not incorporated during candidate generation** , which remains purely semantic.\n\n**What I tried / considered**\n\n**1. Increasing K and Post-retrieval filtering based on**\n\n  * Issue:\n\n    * Still diluted across decades\n\n    * Not reliable for narrow time windows\n\n    * Risk of losing true relevant docs\n\n\n\n\n**2. Time-based sharding (design idea)**\n\n  * Split vector store into **year-wise (or period-wise) shards**\n\n  * Route query to relevant shard(s)\n\n  * Maintain one **global store** for generic queries\nIssues:\n\n  * Requires deeper changes in retrieval + reranking flow\n\n  * Operational overhead (multiple indices)\n\n\n\n\n**Question**\n\n  1. How do production systems typically handle:\n\n     * “entity + time window” queries at scale\n\n     * without sacrificing recall or blowing up latency?\n\n  2. Is **pre-filtering (via sharding or partitioned indices)** generally preferred over post-filtering for time-constrained queries?\n\n\n",
  "title": "FAISS + LMDB RAG on a 50-year corpus works great — until you ask ‘what happened in 2020?’ (time-aware retrieval problem)"
}