Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreiekq2w4rhhmiykghtvniarx4pn4lzplz4jsxl5aa4oa7mmjz2c2wa",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mord2wqzln62"
  },
  "path": "/t/help-with-a-local-document-rag-system-storage-ingestion-query-highlighting/176993#post_2",
  "publishedAt": "2026-06-21T01:12:42.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "LiteParse visual citations",
    "pgvector",
    "Qdrant",
    "Microsoft GraphRAG",
    "Neo4j GraphRAG",
    "PDF.js",
    "PyMuPDF annotations",
    "openpyxl",
    "Kotaemon",
    "RAGFlow DeepDoc",
    "(click for more details)"
  ],
  "textContent": "Hmm… with PDFs and XLSX files, a lot of RAG projects get stuck before the actual RAG part starts. Anyway, as a general architecture, I would think about it like this:\n\n* * *\n\n## Direct answers\n\n### 1. Should you store document IDs and chunk IDs?\n\nYes, but I would treat `document_id` and `chunk_id` as only the minimum.\n\nFor highlighting, you probably want a more provenance-heavy model:\n\n  * `document_id`\n  * `document_version_id`\n  * `chunk_id`\n  * `span_id`\n  * page number\n  * character offsets\n  * bounding boxes\n  * parser element IDs\n  * table IDs\n  * sheet names\n  * cell ranges\n  * source text hash\n\n\n\nA chunk is usually a retrieval unit.\nA highlight needs an evidence unit.\n\nThat distinction matters a lot. A chunk may contain the answer, but it may still be too coarse to highlight the exact sentence, paragraph, table cell, or spreadsheet range.\n\nFor PDF visual citations, LiteParse visual citations are worth looking at because they focus on page screenshots and bounding boxes. I would not treat that as “just citation formatting”; it is closer to a document-layout/provenance problem.\n\n* * *\n\n### 2. Do you need a vector database?\n\nFor a local MVP, I would probably start with:\n\n  * Postgres\n  * pgvector\n  * Postgres full-text search\n  * structured extraction tables\n  * optional reranking\n\n\n\nThe reason is that your system is not only semantic retrieval. You also need:\n\n  * client isolation\n  * versioning\n  * file metadata\n  * structured invoice queries\n  * fiscal-period comparisons\n  * extracted table data\n  * audit trails\n  * citation metadata\n  * highlight metadata\n\n\n\nThose are relational/database problems before they are vector-search problems.\n\nSo I would not make the vector DB the source of truth. I would make the document/version/provenance database the source of truth, and then attach vector search to that.\n\n* * *\n\n### 3. Would pgvector be sufficient?\n\nProbably yes for the first serious local version.\n\n`pgvector` is a good default if you want a single local system where SQL filters, metadata, document versions, and embeddings live close together.\n\nI would move to Qdrant when vector search becomes its own subsystem:\n\n  * larger collections\n  * heavier vector-search tuning\n  * more demanding payload filtering\n  * separate scaling\n  * separate retrieval infrastructure\n  * need for a dedicated vector-search service\n\n\n\nQdrant is a very reasonable choice, especially for vector search with payload filters. I just would not start there unless you already know Postgres will be the bottleneck.\n\nA simple rule:\n\n> Start with Postgres + pgvector if your hardest problems are metadata, versions, permissions, structured queries, and local simplicity. Add Qdrant when vector search itself becomes the hard part.\n\n* * *\n\n### 4. How should doc IDs and chunk IDs be stored with embeddings?\n\nWhichever vector store you use, store the embedding together with enough metadata to get back to the original evidence.\n\nExample payload / metadata:\n\n\n    {\n      \"client_id\": \"client_a\",\n      \"document_id\": \"doc_123\",\n      \"document_version_id\": \"doc_123_v4\",\n      \"chunk_id\": \"chunk_0037\",\n      \"page_start\": 12,\n      \"page_end\": 13,\n      \"span_ids\": [\"span_8801\", \"span_8802\"],\n      \"parser\": \"docling_or_liteparse\",\n      \"source_hash\": \"...\"\n    }\n\n\nBut I would not try to put all visual-highlight data directly inside the vector DB.\n\nBetter split:\n\n  * vector DB / pgvector row: retrieval metadata\n  * relational tables: full provenance, spans, bboxes, cells, tables, versions\n  * file store: original raw documents\n\n\n\nThe retrieval result should point back to the canonical evidence records.\n\n* * *\n\n### 5. What about GraphRAG?\n\nGraphRAG can be useful, but I would not make it the first layer.\n\nI would use GraphRAG later for questions like:\n\n  * “Which companies, people, contracts, and obligations are connected?”\n  * “What are the recurring themes across this corpus?”\n  * “Which documents refer to the same entity or obligation?”\n  * “Summarize relationships across many documents.”\n\n\n\nI would not use it as the first tool for:\n\n  * “Show invoices > ₹1 lakh.”\n  * “Compare FY23 vs FY24 gross profit.”\n  * “Search only this client’s documents.”\n  * “Find this spreadsheet value.”\n\n\n\nThose are structured extraction + SQL problems first.\n\nMicrosoft GraphRAG and Neo4j GraphRAG are good references, but I would add that layer only after the simpler system shows what relation/global-summary questions it cannot answer.\n\n* * *\n\n### 6. How should highlighting work?\n\nI would separate citation from visual highlighting.\n\nA citation can be:\n\n  * document title\n  * page number\n  * chunk ID\n  * quote\n  * source link\n\n\n\nA visual highlight needs more:\n\n  * page coordinates\n  * character offsets\n  * bounding boxes\n  * table cell coordinates\n  * spreadsheet ranges\n  * rendering/export logic\n\n\n\nFor PDF, you probably want one or both of:\n\n  * PDF viewer overlay, for example with PDF.js\n  * annotated PDF export, for example with PyMuPDF annotations\n\n\n\nFor XLSX, the evidence unit is usually not a bounding box. It is more like:\n\n\n    workbook_version_id\n    sheet_name\n    cell_address\n    cell_range\n    table_id\n    row_header_context\n    column_header_context\n    displayed_value\n    formula\n\n\nThen you can generate a colored spreadsheet copy with a library such as openpyxl.\n\n* * *\n\n### 7. Why existing tools may feel close but not enough\n\nMany local RAG apps optimize for “chat with files.”\n\nYour target sounds closer to “audit-grade document evidence.”\n\nThat is a different problem.\n\nExisting tools often cite a chunk, a page, or a source node. You want to jump to the exact sentence, bounding box, table cell, or spreadsheet range. That is a stricter requirement.\n\nThis is probably why tools like Kotaemon / AnythingLLM / RAGFlow can feel close but still not fully match your UX target. I would study them as design references, but I would expect some custom provenance and rendering work.\n\nKotaemon is especially worth studying for citation/document-preview patterns. RAGFlow DeepDoc is worth studying for document parsing/OCR/layout recognition ideas. I would still design your own canonical provenance layer if you need strict PDF/XLSX highlighting and client isolation.\n\n* * *\n\nLonger architecture notes (click for more details)",
  "title": "Help with a Local Document RAG System (Storage + Ingestion + Query + Highlighting)"
}