External Publication

Help with a Local Document RAG System (Storage + Ingestion + Query + Highlighting)

Hugging Face Forums [Unofficial] June 20, 2026

Hey folks, I’m working on designing a local, offline document retrieval + LLM pipeline and would love your input on the architecture. Here’s what I’m aiming for: STORAGE * Upload PDF, DOCX, XLSX, CSV, tables * All data stored locally (no cloud) DOCUMENT INGESTION * Watch folder (e.g., Watchdog) → auto-ingest on file add/modify/delete * Nested folder structure → auto-tagging * Supported formats: PDF, scanned PDF, DOCX, XLSX, CSV, JPG/PNG * Version control on re-upload QUERY & RETRIEVAL * Restrict queries to a single client’s documents (no cross-client leakage) * Structured queries (e.g., “Show invoices > ₹1 lakh”) * Comparative queries (e.g., “Compare FY23 vs FY24 gross profit”) * Keyword fallback HIGHLIGHTING & RENDERING * Annotated PDF served to frontend * XLSX → colored cell export * Jump directly to highlighted page * Multi-document highlights in one response ANSWER GENERATION * Local LLM only * Every claim cited with doc + page reference MY QUESTIONS 1. Parsing: I’m considering LlamaIndex LiteParse. → Should I store document IDs + chunk IDs for PDFs to enable highlighting? 2. Vector DB: * Do I need one (e.g., Qdrant)? * If yes, how do I store doc IDs + chunk IDs alongside embeddings for highlighting? * Would pgvector in Postgres be sufficient? 3. GraphRAGs: * How effective are systems like Neo4j or Microsoft GraphRAG? * Can they run locally/offline, or are they too computationally heavy? * Is this GraphRAG pipeline from LlamaIndex a good starting point? 4. Highlighting UX: * I want something like Turnitin/iThenticate reports → exact sentence highlighted + citation. * Any open-source projects that already do this? * I found Kotaemon and AnythingLLM, which are close but don’t highlight documents. TL;DR Trying to build a local RAG system with: * Storage + ingestion + tagging * Query + retrieval + highlighting * Local LLM answer generation with citations Looking for advice on: * Vector DB vs pgvector * GraphRAG feasibility offline * Best way to implement document highlighting + citation preview Would love to hear from anyone who’s built something similar or explored these tools.

Discussion in the ATmosphere