External Publication
Visit Post

FAISS + LMDB RAG on a 50-year corpus works great — until you ask ‘what happened in 2020?’ (time-aware retrieval problem)

Hugging Face Forums [Unofficial] March 24, 2026
Source

I’m working on a RAG system over a long-span archive (~50 years), and the current retrieval stack performs well for general semantic queries.

However, I’m struggling with time-constrained queries where users implicitly or explicitly expect results from a specific period.

Example query:

“What happened to XYZ political party during the 2020 election?”

The system retrieves semantically relevant content about the entity, but fails to prioritize results within the intended time window, even when increasing K.

System setup

Data

  • Corpus: ~1.6M documents → ~5M vectors after chunking

  • Language: Non-English

Embeddings

  • Model: LaBSE - 768-dim, L2-normalized (cosine / inner product)

  • Chunks: cleaned text segments (noise-reduced)

Index & storage

  • FAISS IVFPQ (primary ANN index)

  • Raw vectors stored in memmap (used for exact rerank on candidates)

  • Docstore: LMDB (ID → chunk + metadata)

Retrieval pipeline

Query → decomposition → main query

→ LaBSE embedding → normalized vector

→ FAISS IVFPQ → top-K candidate IDs

→ memmap → exact dot-product rerank → top-N

→ LMDB → fetch chunks + metadata

→ cross-encoder reranker → final scoring

Performance

  • Recall@5 ≈ 80% (acceptable for general queries)

Query decomposition & temporal signals

Structured signals are extracted from queries, including timeline (explicit years, relative dates normalized to ranges) and main intent.

Each document chunk also contains date metadata , accessible at retrieval time.

However, retrieval currently uses a cleaned entity-focused query , where timeline cues are intentionally removed to improve semantic matching.

Even though both query-side time constraints and document-side timestamps are available, they are not incorporated during candidate generation , which remains purely semantic.

What I tried / considered

1. Increasing K and Post-retrieval filtering based on

  • Issue:

    • Still diluted across decades

    • Not reliable for narrow time windows

    • Risk of losing true relevant docs

2. Time-based sharding (design idea)

  • Split vector store into year-wise (or period-wise) shards

  • Route query to relevant shard(s)

  • Maintain one global store for generic queries Issues:

  • Requires deeper changes in retrieval + reranking flow

  • Operational overhead (multiple indices)

Question

  1. How do production systems typically handle:

    • “entity + time window” queries at scale

    • without sacrificing recall or blowing up latency?

  2. Is pre-filtering (via sharding or partitioned indices) generally preferred over post-filtering for time-constrained queries?

Discussion in the ATmosphere

Loading comments...