Help with a Local Document RAG System (Storage + Ingestion + Query + Highlighting)
Hugging Face Forums [Unofficial]
June 20, 2026
Hey folks,
I’m working on designing a local, offline document retrieval + LLM pipeline and would love your input on the architecture. Here’s what I’m aiming for:
STORAGE
* Upload PDF, DOCX, XLSX, CSV, tables
* All data stored locally (no cloud)
DOCUMENT INGESTION
* Watch folder (e.g., Watchdog) → auto-ingest on file add/modify/delete
* Nested folder structure → auto-tagging
* Supported formats: PDF, scanned PDF, DOCX, XLSX, CSV, JPG/PNG
* Version control on re-upload
QUERY & RETRIEVAL
* Restrict queries to a single client’s documents (no cross-client leakage)
* Structured queries (e.g., “Show invoices > ₹1 lakh”)
* Comparative queries (e.g., “Compare FY23 vs FY24 gross profit”)
* Keyword fallback
HIGHLIGHTING & RENDERING
* Annotated PDF served to frontend
* XLSX → colored cell export
* Jump directly to highlighted page
* Multi-document highlights in one response
ANSWER GENERATION
* Local LLM only
* Every claim cited with doc + page reference
MY QUESTIONS
1. Parsing: I’m considering LlamaIndex LiteParse.
→ Should I store document IDs + chunk IDs for PDFs to enable highlighting?
2. Vector DB:
* Do I need one (e.g., Qdrant)?
* If yes, how do I store doc IDs + chunk IDs alongside embeddings for highlighting?
* Would pgvector in Postgres be sufficient?
3. GraphRAGs:
* How effective are systems like Neo4j or Microsoft GraphRAG?
* Can they run locally/offline, or are they too computationally heavy?
* Is this GraphRAG pipeline from LlamaIndex a good starting point?
4. Highlighting UX:
* I want something like Turnitin/iThenticate reports → exact sentence highlighted + citation.
* Any open-source projects that already do this?
* I found Kotaemon and AnythingLLM, which are close but don’t highlight documents.
TL;DR
Trying to build a local RAG system with:
* Storage + ingestion + tagging
* Query + retrieval + highlighting
* Local LLM answer generation with citations
Looking for advice on:
* Vector DB vs pgvector
* GraphRAG feasibility offline
* Best way to implement document highlighting + citation preview
Would love to hear from anyone who’s built something similar or explored these tools.
Discussion in the ATmosphere