Help with a Local Document RAG System (Storage + Ingestion + Query + Highlighting)
Hmm… with PDFs and XLSX files, a lot of RAG projects get stuck before the actual RAG part starts. Anyway, as a general architecture, I would think about it like this:
Direct answers
1. Should you store document IDs and chunk IDs?
Yes, but I would treat document_id and chunk_id as only the minimum.
For highlighting, you probably want a more provenance-heavy model:
document_iddocument_version_idchunk_idspan_id- page number
- character offsets
- bounding boxes
- parser element IDs
- table IDs
- sheet names
- cell ranges
- source text hash
A chunk is usually a retrieval unit. A highlight needs an evidence unit.
That distinction matters a lot. A chunk may contain the answer, but it may still be too coarse to highlight the exact sentence, paragraph, table cell, or spreadsheet range.
For PDF visual citations, LiteParse visual citations are worth looking at because they focus on page screenshots and bounding boxes. I would not treat that as “just citation formatting”; it is closer to a document-layout/provenance problem.
2. Do you need a vector database?
For a local MVP, I would probably start with:
- Postgres
- pgvector
- Postgres full-text search
- structured extraction tables
- optional reranking
The reason is that your system is not only semantic retrieval. You also need:
- client isolation
- versioning
- file metadata
- structured invoice queries
- fiscal-period comparisons
- extracted table data
- audit trails
- citation metadata
- highlight metadata
Those are relational/database problems before they are vector-search problems.
So I would not make the vector DB the source of truth. I would make the document/version/provenance database the source of truth, and then attach vector search to that.
3. Would pgvector be sufficient?
Probably yes for the first serious local version.
pgvector is a good default if you want a single local system where SQL filters, metadata, document versions, and embeddings live close together.
I would move to Qdrant when vector search becomes its own subsystem:
- larger collections
- heavier vector-search tuning
- more demanding payload filtering
- separate scaling
- separate retrieval infrastructure
- need for a dedicated vector-search service
Qdrant is a very reasonable choice, especially for vector search with payload filters. I just would not start there unless you already know Postgres will be the bottleneck.
A simple rule:
Start with Postgres + pgvector if your hardest problems are metadata, versions, permissions, structured queries, and local simplicity. Add Qdrant when vector search itself becomes the hard part.
4. How should doc IDs and chunk IDs be stored with embeddings?
Whichever vector store you use, store the embedding together with enough metadata to get back to the original evidence.
Example payload / metadata:
{
"client_id": "client_a",
"document_id": "doc_123",
"document_version_id": "doc_123_v4",
"chunk_id": "chunk_0037",
"page_start": 12,
"page_end": 13,
"span_ids": ["span_8801", "span_8802"],
"parser": "docling_or_liteparse",
"source_hash": "..."
}
But I would not try to put all visual-highlight data directly inside the vector DB.
Better split:
- vector DB / pgvector row: retrieval metadata
- relational tables: full provenance, spans, bboxes, cells, tables, versions
- file store: original raw documents
The retrieval result should point back to the canonical evidence records.
5. What about GraphRAG?
GraphRAG can be useful, but I would not make it the first layer.
I would use GraphRAG later for questions like:
- “Which companies, people, contracts, and obligations are connected?”
- “What are the recurring themes across this corpus?”
- “Which documents refer to the same entity or obligation?”
- “Summarize relationships across many documents.”
I would not use it as the first tool for:
- “Show invoices > ₹1 lakh.”
- “Compare FY23 vs FY24 gross profit.”
- “Search only this client’s documents.”
- “Find this spreadsheet value.”
Those are structured extraction + SQL problems first.
Microsoft GraphRAG and Neo4j GraphRAG are good references, but I would add that layer only after the simpler system shows what relation/global-summary questions it cannot answer.
6. How should highlighting work?
I would separate citation from visual highlighting.
A citation can be:
- document title
- page number
- chunk ID
- quote
- source link
A visual highlight needs more:
- page coordinates
- character offsets
- bounding boxes
- table cell coordinates
- spreadsheet ranges
- rendering/export logic
For PDF, you probably want one or both of:
- PDF viewer overlay, for example with PDF.js
- annotated PDF export, for example with PyMuPDF annotations
For XLSX, the evidence unit is usually not a bounding box. It is more like:
workbook_version_id
sheet_name
cell_address
cell_range
table_id
row_header_context
column_header_context
displayed_value
formula
Then you can generate a colored spreadsheet copy with a library such as openpyxl.
7. Why existing tools may feel close but not enough
Many local RAG apps optimize for “chat with files.”
Your target sounds closer to “audit-grade document evidence.”
That is a different problem.
Existing tools often cite a chunk, a page, or a source node. You want to jump to the exact sentence, bounding box, table cell, or spreadsheet range. That is a stricter requirement.
This is probably why tools like Kotaemon / AnythingLLM / RAGFlow can feel close but still not fully match your UX target. I would study them as design references, but I would expect some custom provenance and rendering work.
Kotaemon is especially worth studying for citation/document-preview patterns. RAGFlow DeepDoc is worth studying for document parsing/OCR/layout recognition ideas. I would still design your own canonical provenance layer if you need strict PDF/XLSX highlighting and client isolation.
Longer architecture notes (click for more details)
Discussion in the ATmosphere