Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreichbhc7g3x4rdxfl3t6bn66mkbbfrzylunh3z5hzt3hzmwyu75xfa",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3moyhb4dy7ls2"
  },
  "path": "/t/help-with-a-local-document-rag-system-storage-ingestion-query-highlighting/176993#post_4",
  "publishedAt": "2026-06-23T21:28:53.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "LlamaIndex structured extraction",
    "LangChain structured output",
    "Ollama structured outputs",
    "vLLM structured outputs",
    "Outlines",
    "Instructor",
    "llama.cpp grammars",
    "pandas read_csv",
    "pandas read_excel",
    "openpyxl",
    "structure-aware chunking for tabular data",
    "(click for more details)"
  ],
  "textContent": "I think structured output often depends less on the model alone and more on how robust the checking/verification layer is:\n\n* * *\n\n## Short version\n\nI would separate the pipeline like this:\n\n\n    Parser extracts structure.\n    LLM extracts semantics.\n    Structured output constrains shape.\n    Validation checks types.\n    Verification checks evidence.\n    SQL stores facts.\n    Embeddings help retrieval.\n    Citations/highlights point back to spans/cells.\n\n\nSo, no, I would not store the typed model primarily as embeddings.\n\nI would store typed extraction results as validated structured records, linked back to the source evidence. Then I would optionally embed either the original source chunk or a short textual summary of the extracted record for semantic recall.\n\nA rough flow:\n\n\n    raw document\n      -> parsed text/tables/cells\n      -> typed extraction\n      -> validation\n      -> evidence verification\n      -> SQL rows\n      -> source refs\n      -> optional embeddings\n\n\nEmbeddings are useful as an index.\nThey should not be the source of truth for structured facts.\n\n* * *\n\n## 1. Parser vs extractor\n\nI would keep these two concepts separate.\n\nLayer | Job\n---|---\nParser | Recover document/table/cell structure\nNormalizer | Turn messy pages/sheets/tables into stable internal records\nExtractor | Produce typed objects like `Invoice` or `FinancialMetric`\nStructured output layer | Make the model response parseable\nValidator | Check schema, types, required fields, ranges\nVerifier | Check source span/cell and deterministic values\nSQL store | Store extracted facts as queryable records\nEmbeddings | Support fuzzy/semantic retrieval\nCitation layer | Point answers back to pages/spans/cells\n\nA parser might tell you:\n\n\n    {\n      \"sheet\": \"FY24\",\n      \"cell\": \"D17\",\n      \"value\": \"1250000\",\n      \"number_format\": \"₹#,##0\",\n      \"row_header\": \"Gross Profit\",\n      \"column_header\": \"FY24\"\n    }\n\n\nAn extractor might turn that into:\n\n\n    {\n      \"metric_name\": \"Gross Profit\",\n      \"period\": \"FY24\",\n      \"value\": 1250000,\n      \"currency\": \"INR\",\n      \"source\": {\n        \"document_id\": \"doc_123\",\n        \"document_version_id\": \"v3\",\n        \"sheet\": \"FY24\",\n        \"cell\": \"D17\"\n      }\n    }\n\n\nThose are different layers.\n\nThe parser preserves structure.\nThe extractor assigns meaning.\n\n* * *\n\n## 2. Where typed models should go\n\nFor typed data, I would use something like Pydantic / JSON Schema / SQL tables.\n\nFor example:\n\n\n    from datetime import date\n    from decimal import Decimal\n    from pydantic import BaseModel\n\n    class Invoice(BaseModel):\n        invoice_number: str | None\n        vendor_name: str | None\n        invoice_date: date | None\n        total_amount: Decimal | None\n        currency: str | None\n\n        source_document_id: str\n        source_version_id: str\n        source_page: int | None = None\n        source_sheet: str | None = None\n        source_cell_range: str | None = None\n        source_span_id: str | None = None\n\n\nThen store it as a structured row.\n\nPossible tables:\n\n\n    invoices\n    invoice_line_items\n    financial_metrics\n    period_values\n    entities\n    source_refs\n\n\nIf you want semantic search over extracted records, you can also create a text summary and embed that:\n\n\n    Invoice INV-2024-018 from Vendor ABC, dated 2024-03-18, total ₹125,000, source: doc_123 page 4.\n\n\nBut the embedding should point back to the structured row, not replace it.\n\nSo I would use both:\n\nObject | Store where? | Why\n---|---|---\nTyped record | SQL / relational DB | source of truth\nSource span/cell | provenance tables | citation/highlight\nOriginal text/table chunk | document store | audit/debug\nEmbedding | vector index / pgvector / Qdrant | recall\nText summary of typed record | optional embedding input | semantic lookup\n\nThe risky design is:\n\n\n    typed fact -> only embedding -> retrieve later\n\n\nThe safer design is:\n\n\n    typed fact -> validated SQL row -> evidence link -> optional embedding\n\n\n* * *\n\n## 3. Structured output helps, but does not prove correctness\n\nLocal models can be used for typed extraction. I would just frame it as an evaluation question rather than a yes/no model-size question.\n\nA better question is:\n\n> Can this local model extract my target fields from my real files, with valid schema, correct values, and source evidence links?\n\nThere are several tools/patterns worth knowing:\n\n  * LlamaIndex structured extraction uses typed/Pydantic-style extraction patterns.\n  * LangChain structured output supports schema-based outputs.\n  * Ollama structured outputs supports JSON Schema-style constrained outputs.\n  * vLLM structured outputs supports JSON Schema / Pydantic-style structured output.\n  * Outlines is useful if you want provider-independent structured generation.\n  * Instructor is useful for Pydantic validation/retry style extraction.\n  * llama.cpp grammars are useful if you are running local constrained decoding.\n\n\n\nBut the important caveat is:\n\n> Structured output can make the output parseable. It does not automatically make the extracted value correct.\n\nFor example, this can be valid JSON and still wrong:\n\n\n    {\n      \"metric_name\": \"Gross Profit\",\n      \"period\": \"FY24\",\n      \"value\": 1900000,\n      \"currency\": \"INR\",\n      \"source_cell\": \"D17\"\n    }\n\n\nThe JSON may be valid.\nThe Pydantic model may validate.\nBut the value may still not match the sheet.\n\nSo I would use a pipeline like:\n\n\n    LLM extraction\n      -> schema validation\n      -> deterministic checks where possible\n      -> evidence/source check\n      -> retry or mark uncertain\n      -> store structured record\n\n\nFor invoices, dates, amounts, currencies, fiscal years, and spreadsheet cells, many checks can be partly deterministic.\n\nExamples:\n\nField | Possible check\n---|---\namount | parse as decimal, compare against source span/cell\ncurrency | normalize symbol/code\ndate | parse date and keep original string\nfiscal year | normalize FY23 / FY2023 / 2023\ncell reference | verify sheet/cell exists\ninvoice total | compare against table total if available\nsource quote | check quote appears in source span\nsource cell | check extracted value matches cell/range\n\nThis is why I would store both the extracted value and its evidence reference.\n\n* * *\n\n## 4. CSV/XLSX: do not start with text chunks\n\nFor CSV/XLSX, I would not begin with normal text chunking.\n\nThink in:\n\n\n    workbook\n      -> sheet\n      -> table / detected region\n      -> row\n      -> column\n      -> cell\n      -> range\n      -> header context\n\n\nCSV is usually closer to a table.\nXLSX can be a table, a report, a financial model, or a visual workbook.\n\nSo I would treat them differently.\n\nFormat | First representation\n---|---\nCSV | DataFrame / SQL table\nSimple XLSX table | DataFrame + sheet/cell provenance\nMessy XLSX report | sheet/region/table/cell model\nFinancial model | workbook graph + formulas + values\nHighlight/export target | workbook/sheet/cell/range provenance\n\nFor table-like data, pandas read_csv and pandas read_excel are natural starting points.\n\nFor cell-level provenance, formulas, styles, and annotated XLSX export, openpyxl is useful.\n\nA practical rule:\n\n> Use DataFrames for table-like regions, but keep workbook/sheet/cell provenance separately.\n\nDo not flatten XLSX into plain text too early if you need citations or highlights.\n\n* * *\n\n## 5. Chunk overlap for spreadsheets\n\nFor normal prose documents, chunk overlap can help because sentences/paragraphs cross chunk boundaries.\n\nFor spreadsheets, overlap is usually not the main question.\n\nA better question is:\n\n> What is the correct table/window/header context for this row or cell?\n\nFor a spreadsheet row, the useful context may be:\n\n  * sheet name\n  * table title\n  * section title\n  * row label\n  * column label\n  * nearby headers\n  * units\n  * currency\n  * fiscal period\n  * notes\n  * formula\n  * displayed value\n\n\n\nA good spreadsheet “chunk” might be:\n\n\n    sheet + detected table + header rows + row group + relevant rows\n\n\nor:\n\n\n    cell/range + row header context + column header context + surrounding table metadata\n\n\nSo instead of arbitrary text overlap, I would use structure-aware windows.\n\nA simple approach:\n\nUnit | Good for\n---|---\nwhole sheet summary | sheet-level routing\ndetected table summary | choosing relevant table\nrow-level record | filtering/search\ncell/range evidence | citation/highlight\nrow window with headers | LLM explanation\nstructured SQL row | numeric query\n\nThere is also research around structure-aware chunking for tabular data, which is relevant because ordinary text chunking does not preserve table structure well.\n\n* * *\n\n## 6. Should CSV/XLSX become generalized typed models?\n\nI would not use one universal typed model for everything.\n\nI would use a few domain-specific typed models plus a generic source reference.\n\nFor example:\n\n\n    Document\n    DocumentVersion\n    SourceRef\n    ParsedTable\n    ParsedCell\n    Invoice\n    InvoiceLineItem\n    FinancialMetric\n    PeriodValue\n    Entity\n\n\n`SourceRef` is the glue.\n\nExample:\n\n\n    {\n      \"source_ref_id\": \"src_8821\",\n      \"document_id\": \"doc_123\",\n      \"document_version_id\": \"v3\",\n      \"page\": null,\n      \"sheet\": \"FY24\",\n      \"cell_range\": \"D17:D18\",\n      \"span_id\": null,\n      \"bbox\": null,\n      \"source_text_hash\": \"...\"\n    }\n\n\nThen an extracted fact can point to `source_ref_id`.\n\nThat gives you:\n\n  * SQL queryability\n  * auditability\n  * citation\n  * highlight/export\n  * reprocessing safety\n\n\n\n* * *\n\n## 7. What to embed\n\nI would embed different things for different purposes.\n\nEmbedding input | Purpose\n---|---\noriginal text chunks | semantic retrieval over prose\ntable summaries | route to relevant table\nrow summaries | semantic lookup over rows\ntyped record summaries | semantic lookup over extracted facts\ndocument summaries | coarse routing\nraw numeric-only cells | usually weak for embeddings\n\nFor example, this is more useful to embed than just `1250000`:\n\n\n    FY24 Gross Profit is ₹1,250,000, extracted from sheet FY24 cell D17 in document doc_123.\n\n\nBut the value should still live in SQL as a numeric value.\n\nThat way:\n\n  * semantic search finds candidate records\n  * SQL does exact filtering/comparison\n  * source refs provide citation/highlight\n\n\n\n* * *\n\n## 8. Example flow: invoice threshold query\n\nFor:\n\n\n    Show invoices > ₹1 lakh\n\n\nI would not rely on embeddings.\n\nI would do:\n\n\n    parse documents\n      -> extract Invoice records\n      -> validate amount/currency/date\n      -> store in SQL\n      -> SQL query total_amount > 100000\n      -> fetch source refs\n      -> render answer with citations/highlights\n\n\nEmbeddings may help if the user asks a fuzzy question like:\n\n\n    Find large vendor bills related to consulting.\n\n\nBut once the candidate domain is identified, the threshold should be SQL.\n\n* * *\n\n## 9. Example flow: FY23 vs FY24 gross profit\n\nFor:\n\n\n    Compare FY23 vs FY24 gross profit\n\n\nI would aim for:\n\n\n    parse spreadsheet/PDF table\n      -> identify relevant table/metric\n      -> extract FinancialMetric records\n      -> normalize fiscal periods\n      -> store values in SQL\n      -> compare numerically\n      -> cite source cells/spans\n\n\nPossible typed model:\n\n\n    from decimal import Decimal\n    from pydantic import BaseModel\n\n    class FinancialMetric(BaseModel):\n        metric_name: str\n        period: str\n        value: Decimal\n        currency: str | None = None\n        unit: str | None = None\n        source_ref_id: str\n\n\nAgain, embeddings can help find “gross profit” semantically, but the comparison itself should be structured.\n\n* * *\n\n## 10. Suggested minimal implementation\n\nIf you want a simple version first:\n\n\n    1. Parse CSV/XLSX into DataFrames where possible.\n    2. Preserve sheet/cell/range provenance.\n    3. Define 2-3 typed models only.\n    4. Use local structured output to extract records.\n    5. Validate records.\n    6. Verify source refs and deterministic values where possible.\n    7. Store records in Postgres.\n    8. Store source refs separately.\n    9. Embed source chunks and optional record summaries.\n    10. Use SQL for structured questions.\n    11. Use embeddings for fuzzy recall.\n\n\nStart small.\n\nMaybe begin with:\n\n\n    Invoice\n    FinancialMetric\n    SourceRef\n\n\nThen expand only when needed.\n\n* * *\n\nA slightly more concrete data model (click for more details) Possible extraction loop (click for more details) Tiny eval set I would build early (click for more details)\n\n## My practical answer\n\nSo for your specific question:\n\n> Should the LLM extract typed models from chunk text and store them in embeddings?\n\nI would answer:\n\n> Let the LLM extract typed models from parsed text/table/cell context, but store those typed models as validated structured records linked to evidence. Use embeddings only as a retrieval index over chunks, table summaries, row summaries, or record summaries.\n\nAnd for CSV/XLSX:\n\n> Do not treat them as prose documents first. Treat them as tables/sheets/cells/ranges first, and only create text representations after preserving that structure.\n\nThat separation should keep the system much easier to debug later.",
  "title": "Help with a Local Document RAG System (Storage + Ingestion + Query + Highlighting)"
}