Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreiag7jtwjrc3q2f3qflpympzi52iifwdvyknnrufspswduewn4o474",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mopniffwjl62"
  },
  "path": "/t/help-with-a-local-document-rag-system-storage-ingestion-query-highlighting/176993#post_1",
  "publishedAt": "2026-06-20T08:44:01.000Z",
  "site": "https://discuss.huggingface.co",
  "textContent": "Hey folks,\n\nI’m working on designing a local, offline document retrieval + LLM pipeline and would love your input on the architecture. Here’s what I’m aiming for:\n\nSTORAGE\n\n  * Upload PDF, DOCX, XLSX, CSV, tables\n  * All data stored locally (no cloud)\n\n\n\nDOCUMENT INGESTION\n\n  * Watch folder (e.g., Watchdog) → auto-ingest on file add/modify/delete\n  * Nested folder structure → auto-tagging\n  * Supported formats: PDF, scanned PDF, DOCX, XLSX, CSV, JPG/PNG\n  * Version control on re-upload\n\n\n\nQUERY & RETRIEVAL\n\n  * Restrict queries to a single client’s documents (no cross-client leakage)\n  * Structured queries (e.g., “Show invoices > ₹1 lakh”)\n  * Comparative queries (e.g., “Compare FY23 vs FY24 gross profit”)\n  * Keyword fallback\n\n\n\nHIGHLIGHTING & RENDERING\n\n  * Annotated PDF served to frontend\n  * XLSX → colored cell export\n  * Jump directly to highlighted page\n  * Multi-document highlights in one response\n\n\n\nANSWER GENERATION\n\n  * Local LLM only\n  * Every claim cited with doc + page reference\n\n\n\nMY QUESTIONS\n\n  1. Parsing: I’m considering LlamaIndex LiteParse.\n→ Should I store document IDs + chunk IDs for PDFs to enable highlighting?\n\n  2. Vector DB:\n\n     * Do I need one (e.g., Qdrant)?\n     * If yes, how do I store doc IDs + chunk IDs alongside embeddings for highlighting?\n     * Would pgvector in Postgres be sufficient?\n  3. GraphRAGs:\n\n     * How effective are systems like Neo4j or Microsoft GraphRAG?\n     * Can they run locally/offline, or are they too computationally heavy?\n     * Is this GraphRAG pipeline from LlamaIndex a good starting point?\n  4. Highlighting UX:\n\n     * I want something like Turnitin/iThenticate reports → exact sentence highlighted + citation.\n     * Any open-source projects that already do this?\n     * I found Kotaemon and AnythingLLM, which are close but don’t highlight documents.\n\n\n\nTL;DR\nTrying to build a local RAG system with:\n\n  * Storage + ingestion + tagging\n  * Query + retrieval + highlighting\n  * Local LLM answer generation with citations\n\n\n\nLooking for advice on:\n\n  * Vector DB vs pgvector\n  * GraphRAG feasibility offline\n  * Best way to implement document highlighting + citation preview\n\n\n\nWould love to hear from anyone who’s built something similar or explored these tools.",
  "title": "Help with a Local Document RAG System (Storage + Ingestion + Query + Highlighting)"
}