Help with a Local Document RAG System (Storage + Ingestion + Query + Highlighting)
I think structured output often depends less on the model alone and more on how robust the checking/verification layer is:
Short version
I would separate the pipeline like this:
Parser extracts structure.
LLM extracts semantics.
Structured output constrains shape.
Validation checks types.
Verification checks evidence.
SQL stores facts.
Embeddings help retrieval.
Citations/highlights point back to spans/cells.
So, no, I would not store the typed model primarily as embeddings.
I would store typed extraction results as validated structured records, linked back to the source evidence. Then I would optionally embed either the original source chunk or a short textual summary of the extracted record for semantic recall.
A rough flow:
raw document
-> parsed text/tables/cells
-> typed extraction
-> validation
-> evidence verification
-> SQL rows
-> source refs
-> optional embeddings
Embeddings are useful as an index. They should not be the source of truth for structured facts.
1. Parser vs extractor
I would keep these two concepts separate.
| Layer | Job |
|---|---|
| Parser | Recover document/table/cell structure |
| Normalizer | Turn messy pages/sheets/tables into stable internal records |
| Extractor | Produce typed objects like Invoice or FinancialMetric |
| Structured output layer | Make the model response parseable |
| Validator | Check schema, types, required fields, ranges |
| Verifier | Check source span/cell and deterministic values |
| SQL store | Store extracted facts as queryable records |
| Embeddings | Support fuzzy/semantic retrieval |
| Citation layer | Point answers back to pages/spans/cells |
A parser might tell you:
{
"sheet": "FY24",
"cell": "D17",
"value": "1250000",
"number_format": "₹#,##0",
"row_header": "Gross Profit",
"column_header": "FY24"
}
An extractor might turn that into:
{
"metric_name": "Gross Profit",
"period": "FY24",
"value": 1250000,
"currency": "INR",
"source": {
"document_id": "doc_123",
"document_version_id": "v3",
"sheet": "FY24",
"cell": "D17"
}
}
Those are different layers.
The parser preserves structure. The extractor assigns meaning.
2. Where typed models should go
For typed data, I would use something like Pydantic / JSON Schema / SQL tables.
For example:
from datetime import date
from decimal import Decimal
from pydantic import BaseModel
class Invoice(BaseModel):
invoice_number: str | None
vendor_name: str | None
invoice_date: date | None
total_amount: Decimal | None
currency: str | None
source_document_id: str
source_version_id: str
source_page: int | None = None
source_sheet: str | None = None
source_cell_range: str | None = None
source_span_id: str | None = None
Then store it as a structured row.
Possible tables:
invoices
invoice_line_items
financial_metrics
period_values
entities
source_refs
If you want semantic search over extracted records, you can also create a text summary and embed that:
Invoice INV-2024-018 from Vendor ABC, dated 2024-03-18, total ₹125,000, source: doc_123 page 4.
But the embedding should point back to the structured row, not replace it.
So I would use both:
| Object | Store where? | Why |
|---|---|---|
| Typed record | SQL / relational DB | source of truth |
| Source span/cell | provenance tables | citation/highlight |
| Original text/table chunk | document store | audit/debug |
| Embedding | vector index / pgvector / Qdrant | recall |
| Text summary of typed record | optional embedding input | semantic lookup |
The risky design is:
typed fact -> only embedding -> retrieve later
The safer design is:
typed fact -> validated SQL row -> evidence link -> optional embedding
3. Structured output helps, but does not prove correctness
Local models can be used for typed extraction. I would just frame it as an evaluation question rather than a yes/no model-size question.
A better question is:
Can this local model extract my target fields from my real files, with valid schema, correct values, and source evidence links?
There are several tools/patterns worth knowing:
- LlamaIndex structured extraction uses typed/Pydantic-style extraction patterns.
- LangChain structured output supports schema-based outputs.
- Ollama structured outputs supports JSON Schema-style constrained outputs.
- vLLM structured outputs supports JSON Schema / Pydantic-style structured output.
- Outlines is useful if you want provider-independent structured generation.
- Instructor is useful for Pydantic validation/retry style extraction.
- llama.cpp grammars are useful if you are running local constrained decoding.
But the important caveat is:
Structured output can make the output parseable. It does not automatically make the extracted value correct.
For example, this can be valid JSON and still wrong:
{
"metric_name": "Gross Profit",
"period": "FY24",
"value": 1900000,
"currency": "INR",
"source_cell": "D17"
}
The JSON may be valid. The Pydantic model may validate. But the value may still not match the sheet.
So I would use a pipeline like:
LLM extraction
-> schema validation
-> deterministic checks where possible
-> evidence/source check
-> retry or mark uncertain
-> store structured record
For invoices, dates, amounts, currencies, fiscal years, and spreadsheet cells, many checks can be partly deterministic.
Examples:
| Field | Possible check |
|---|---|
| amount | parse as decimal, compare against source span/cell |
| currency | normalize symbol/code |
| date | parse date and keep original string |
| fiscal year | normalize FY23 / FY2023 / 2023 |
| cell reference | verify sheet/cell exists |
| invoice total | compare against table total if available |
| source quote | check quote appears in source span |
| source cell | check extracted value matches cell/range |
This is why I would store both the extracted value and its evidence reference.
4. CSV/XLSX: do not start with text chunks
For CSV/XLSX, I would not begin with normal text chunking.
Think in:
workbook
-> sheet
-> table / detected region
-> row
-> column
-> cell
-> range
-> header context
CSV is usually closer to a table. XLSX can be a table, a report, a financial model, or a visual workbook.
So I would treat them differently.
| Format | First representation |
|---|---|
| CSV | DataFrame / SQL table |
| Simple XLSX table | DataFrame + sheet/cell provenance |
| Messy XLSX report | sheet/region/table/cell model |
| Financial model | workbook graph + formulas + values |
| Highlight/export target | workbook/sheet/cell/range provenance |
For table-like data, pandas read_csv and pandas read_excel are natural starting points.
For cell-level provenance, formulas, styles, and annotated XLSX export, openpyxl is useful.
A practical rule:
Use DataFrames for table-like regions, but keep workbook/sheet/cell provenance separately.
Do not flatten XLSX into plain text too early if you need citations or highlights.
5. Chunk overlap for spreadsheets
For normal prose documents, chunk overlap can help because sentences/paragraphs cross chunk boundaries.
For spreadsheets, overlap is usually not the main question.
A better question is:
What is the correct table/window/header context for this row or cell?
For a spreadsheet row, the useful context may be:
- sheet name
- table title
- section title
- row label
- column label
- nearby headers
- units
- currency
- fiscal period
- notes
- formula
- displayed value
A good spreadsheet “chunk” might be:
sheet + detected table + header rows + row group + relevant rows
or:
cell/range + row header context + column header context + surrounding table metadata
So instead of arbitrary text overlap, I would use structure-aware windows.
A simple approach:
| Unit | Good for |
|---|---|
| whole sheet summary | sheet-level routing |
| detected table summary | choosing relevant table |
| row-level record | filtering/search |
| cell/range evidence | citation/highlight |
| row window with headers | LLM explanation |
| structured SQL row | numeric query |
There is also research around structure-aware chunking for tabular data, which is relevant because ordinary text chunking does not preserve table structure well.
6. Should CSV/XLSX become generalized typed models?
I would not use one universal typed model for everything.
I would use a few domain-specific typed models plus a generic source reference.
For example:
Document
DocumentVersion
SourceRef
ParsedTable
ParsedCell
Invoice
InvoiceLineItem
FinancialMetric
PeriodValue
Entity
SourceRef is the glue.
Example:
{
"source_ref_id": "src_8821",
"document_id": "doc_123",
"document_version_id": "v3",
"page": null,
"sheet": "FY24",
"cell_range": "D17:D18",
"span_id": null,
"bbox": null,
"source_text_hash": "..."
}
Then an extracted fact can point to source_ref_id.
That gives you:
- SQL queryability
- auditability
- citation
- highlight/export
- reprocessing safety
7. What to embed
I would embed different things for different purposes.
| Embedding input | Purpose |
|---|---|
| original text chunks | semantic retrieval over prose |
| table summaries | route to relevant table |
| row summaries | semantic lookup over rows |
| typed record summaries | semantic lookup over extracted facts |
| document summaries | coarse routing |
| raw numeric-only cells | usually weak for embeddings |
For example, this is more useful to embed than just 1250000:
FY24 Gross Profit is ₹1,250,000, extracted from sheet FY24 cell D17 in document doc_123.
But the value should still live in SQL as a numeric value.
That way:
- semantic search finds candidate records
- SQL does exact filtering/comparison
- source refs provide citation/highlight
8. Example flow: invoice threshold query
For:
Show invoices > ₹1 lakh
I would not rely on embeddings.
I would do:
parse documents
-> extract Invoice records
-> validate amount/currency/date
-> store in SQL
-> SQL query total_amount > 100000
-> fetch source refs
-> render answer with citations/highlights
Embeddings may help if the user asks a fuzzy question like:
Find large vendor bills related to consulting.
But once the candidate domain is identified, the threshold should be SQL.
9. Example flow: FY23 vs FY24 gross profit
For:
Compare FY23 vs FY24 gross profit
I would aim for:
parse spreadsheet/PDF table
-> identify relevant table/metric
-> extract FinancialMetric records
-> normalize fiscal periods
-> store values in SQL
-> compare numerically
-> cite source cells/spans
Possible typed model:
from decimal import Decimal
from pydantic import BaseModel
class FinancialMetric(BaseModel):
metric_name: str
period: str
value: Decimal
currency: str | None = None
unit: str | None = None
source_ref_id: str
Again, embeddings can help find “gross profit” semantically, but the comparison itself should be structured.
10. Suggested minimal implementation
If you want a simple version first:
1. Parse CSV/XLSX into DataFrames where possible.
2. Preserve sheet/cell/range provenance.
3. Define 2-3 typed models only.
4. Use local structured output to extract records.
5. Validate records.
6. Verify source refs and deterministic values where possible.
7. Store records in Postgres.
8. Store source refs separately.
9. Embed source chunks and optional record summaries.
10. Use SQL for structured questions.
11. Use embeddings for fuzzy recall.
Start small.
Maybe begin with:
Invoice
FinancialMetric
SourceRef
Then expand only when needed.
A slightly more concrete data model (click for more details) Possible extraction loop (click for more details) Tiny eval set I would build early (click for more details)
My practical answer
So for your specific question:
Should the LLM extract typed models from chunk text and store them in embeddings?
I would answer:
Let the LLM extract typed models from parsed text/table/cell context, but store those typed models as validated structured records linked to evidence. Use embeddings only as a retrieval index over chunks, table summaries, row summaries, or record summaries.
And for CSV/XLSX:
Do not treat them as prose documents first. Treat them as tables/sheets/cells/ranges first, and only create text representations after preserving that structure.
That separation should keep the system much easier to debug later.
Discussion in the ATmosphere