External Publication

Help with a Local Document RAG System (Storage + Ingestion + Query + Highlighting)

Hugging Face Forums [Unofficial] June 21, 2026

Hmm… with PDFs and XLSX files, a lot of RAG projects get stuck before the actual RAG part starts. Anyway, as a general architecture, I would think about it like this:

Direct answers

1. Should you store document IDs and chunk IDs?

Yes, but I would treat document_id and chunk_id as only the minimum.

For highlighting, you probably want a more provenance-heavy model:

document_id
document_version_id
chunk_id
span_id
page number
character offsets
bounding boxes
parser element IDs
table IDs
sheet names
cell ranges
source text hash

A chunk is usually a retrieval unit. A highlight needs an evidence unit.

That distinction matters a lot. A chunk may contain the answer, but it may still be too coarse to highlight the exact sentence, paragraph, table cell, or spreadsheet range.

For PDF visual citations, LiteParse visual citations are worth looking at because they focus on page screenshots and bounding boxes. I would not treat that as “just citation formatting”; it is closer to a document-layout/provenance problem.

2. Do you need a vector database?

For a local MVP, I would probably start with:

Postgres
pgvector
Postgres full-text search
structured extraction tables
optional reranking

The reason is that your system is not only semantic retrieval. You also need:

client isolation
versioning
file metadata
structured invoice queries
fiscal-period comparisons
extracted table data
audit trails
citation metadata
highlight metadata

Those are relational/database problems before they are vector-search problems.

So I would not make the vector DB the source of truth. I would make the document/version/provenance database the source of truth, and then attach vector search to that.

3. Would pgvector be sufficient?

Probably yes for the first serious local version.

pgvector is a good default if you want a single local system where SQL filters, metadata, document versions, and embeddings live close together.

I would move to Qdrant when vector search becomes its own subsystem:

larger collections
heavier vector-search tuning
more demanding payload filtering
separate scaling
separate retrieval infrastructure
need for a dedicated vector-search service

Qdrant is a very reasonable choice, especially for vector search with payload filters. I just would not start there unless you already know Postgres will be the bottleneck.

A simple rule:

Start with Postgres + pgvector if your hardest problems are metadata, versions, permissions, structured queries, and local simplicity. Add Qdrant when vector search itself becomes the hard part.

4. How should doc IDs and chunk IDs be stored with embeddings?

Whichever vector store you use, store the embedding together with enough metadata to get back to the original evidence.

Example payload / metadata:

{
  "client_id": "client_a",
  "document_id": "doc_123",
  "document_version_id": "doc_123_v4",
  "chunk_id": "chunk_0037",
  "page_start": 12,
  "page_end": 13,
  "span_ids": ["span_8801", "span_8802"],
  "parser": "docling_or_liteparse",
  "source_hash": "..."
}

But I would not try to put all visual-highlight data directly inside the vector DB.

Better split:

vector DB / pgvector row: retrieval metadata
relational tables: full provenance, spans, bboxes, cells, tables, versions
file store: original raw documents

The retrieval result should point back to the canonical evidence records.

5. What about GraphRAG?

GraphRAG can be useful, but I would not make it the first layer.

I would use GraphRAG later for questions like:

“Which companies, people, contracts, and obligations are connected?”
“What are the recurring themes across this corpus?”
“Which documents refer to the same entity or obligation?”
“Summarize relationships across many documents.”

I would not use it as the first tool for:

“Show invoices > ₹1 lakh.”
“Compare FY23 vs FY24 gross profit.”
“Search only this client’s documents.”
“Find this spreadsheet value.”

Those are structured extraction + SQL problems first.

Microsoft GraphRAG and Neo4j GraphRAG are good references, but I would add that layer only after the simpler system shows what relation/global-summary questions it cannot answer.

6. How should highlighting work?

I would separate citation from visual highlighting.

A citation can be:

document title
page number
chunk ID
quote
source link

A visual highlight needs more:

page coordinates
character offsets
bounding boxes
table cell coordinates
spreadsheet ranges
rendering/export logic

For PDF, you probably want one or both of:

PDF viewer overlay, for example with PDF.js
annotated PDF export, for example with PyMuPDF annotations

For XLSX, the evidence unit is usually not a bounding box. It is more like:

workbook_version_id
sheet_name
cell_address
cell_range
table_id
row_header_context
column_header_context
displayed_value
formula

Then you can generate a colored spreadsheet copy with a library such as openpyxl.

7. Why existing tools may feel close but not enough

Many local RAG apps optimize for “chat with files.”

Your target sounds closer to “audit-grade document evidence.”

That is a different problem.

Existing tools often cite a chunk, a page, or a source node. You want to jump to the exact sentence, bounding box, table cell, or spreadsheet range. That is a stricter requirement.

This is probably why tools like Kotaemon / AnythingLLM / RAGFlow can feel close but still not fully match your UX target. I would study them as design references, but I would expect some custom provenance and rendering work.

Kotaemon is especially worth studying for citation/document-preview patterns. RAGFlow DeepDoc is worth studying for document parsing/OCR/layout recognition ideas. I would still design your own canonical provenance layer if you need strict PDF/XLSX highlighting and client isolation.

Longer architecture notes (click for more details)