{
"$type": "site.standard.document",
"bskyPostRef": {
"cid": "bafyreiag7jtwjrc3q2f3qflpympzi52iifwdvyknnrufspswduewn4o474",
"uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mopniffwjl62"
},
"path": "/t/help-with-a-local-document-rag-system-storage-ingestion-query-highlighting/176993#post_1",
"publishedAt": "2026-06-20T08:44:01.000Z",
"site": "https://discuss.huggingface.co",
"textContent": "Hey folks,\n\nI’m working on designing a local, offline document retrieval + LLM pipeline and would love your input on the architecture. Here’s what I’m aiming for:\n\nSTORAGE\n\n * Upload PDF, DOCX, XLSX, CSV, tables\n * All data stored locally (no cloud)\n\n\n\nDOCUMENT INGESTION\n\n * Watch folder (e.g., Watchdog) → auto-ingest on file add/modify/delete\n * Nested folder structure → auto-tagging\n * Supported formats: PDF, scanned PDF, DOCX, XLSX, CSV, JPG/PNG\n * Version control on re-upload\n\n\n\nQUERY & RETRIEVAL\n\n * Restrict queries to a single client’s documents (no cross-client leakage)\n * Structured queries (e.g., “Show invoices > ₹1 lakh”)\n * Comparative queries (e.g., “Compare FY23 vs FY24 gross profit”)\n * Keyword fallback\n\n\n\nHIGHLIGHTING & RENDERING\n\n * Annotated PDF served to frontend\n * XLSX → colored cell export\n * Jump directly to highlighted page\n * Multi-document highlights in one response\n\n\n\nANSWER GENERATION\n\n * Local LLM only\n * Every claim cited with doc + page reference\n\n\n\nMY QUESTIONS\n\n 1. Parsing: I’m considering LlamaIndex LiteParse.\n→ Should I store document IDs + chunk IDs for PDFs to enable highlighting?\n\n 2. Vector DB:\n\n * Do I need one (e.g., Qdrant)?\n * If yes, how do I store doc IDs + chunk IDs alongside embeddings for highlighting?\n * Would pgvector in Postgres be sufficient?\n 3. GraphRAGs:\n\n * How effective are systems like Neo4j or Microsoft GraphRAG?\n * Can they run locally/offline, or are they too computationally heavy?\n * Is this GraphRAG pipeline from LlamaIndex a good starting point?\n 4. Highlighting UX:\n\n * I want something like Turnitin/iThenticate reports → exact sentence highlighted + citation.\n * Any open-source projects that already do this?\n * I found Kotaemon and AnythingLLM, which are close but don’t highlight documents.\n\n\n\nTL;DR\nTrying to build a local RAG system with:\n\n * Storage + ingestion + tagging\n * Query + retrieval + highlighting\n * Local LLM answer generation with citations\n\n\n\nLooking for advice on:\n\n * Vector DB vs pgvector\n * GraphRAG feasibility offline\n * Best way to implement document highlighting + citation preview\n\n\n\nWould love to hear from anyone who’s built something similar or explored these tools.",
"title": "Help with a Local Document RAG System (Storage + Ingestion + Query + Highlighting)"
}