External Publication

Help with a Local Document RAG System (Storage + Ingestion + Query + Highlighting)

Hugging Face Forums [Unofficial] June 23, 2026

I think structured output often depends less on the model alone and more on how robust the checking/verification layer is:

Short version

I would separate the pipeline like this:

Parser extracts structure.
LLM extracts semantics.
Structured output constrains shape.
Validation checks types.
Verification checks evidence.
SQL stores facts.
Embeddings help retrieval.
Citations/highlights point back to spans/cells.

So, no, I would not store the typed model primarily as embeddings.

I would store typed extraction results as validated structured records, linked back to the source evidence. Then I would optionally embed either the original source chunk or a short textual summary of the extracted record for semantic recall.

A rough flow:

raw document
  -> parsed text/tables/cells
  -> typed extraction
  -> validation
  -> evidence verification
  -> SQL rows
  -> source refs
  -> optional embeddings

Embeddings are useful as an index. They should not be the source of truth for structured facts.

1. Parser vs extractor

I would keep these two concepts separate.

Layer	Job
Parser	Recover document/table/cell structure
Normalizer	Turn messy pages/sheets/tables into stable internal records
Extractor	Produce typed objects like `Invoice` or `FinancialMetric`
Structured output layer	Make the model response parseable
Validator	Check schema, types, required fields, ranges
Verifier	Check source span/cell and deterministic values
SQL store	Store extracted facts as queryable records
Embeddings	Support fuzzy/semantic retrieval
Citation layer	Point answers back to pages/spans/cells

A parser might tell you:

{
  "sheet": "FY24",
  "cell": "D17",
  "value": "1250000",
  "number_format": "₹#,##0",
  "row_header": "Gross Profit",
  "column_header": "FY24"
}

An extractor might turn that into:

{
  "metric_name": "Gross Profit",
  "period": "FY24",
  "value": 1250000,
  "currency": "INR",
  "source": {
    "document_id": "doc_123",
    "document_version_id": "v3",
    "sheet": "FY24",
    "cell": "D17"
  }
}

Those are different layers.

The parser preserves structure. The extractor assigns meaning.

2. Where typed models should go

For typed data, I would use something like Pydantic / JSON Schema / SQL tables.

For example:

from datetime import date
from decimal import Decimal
from pydantic import BaseModel

class Invoice(BaseModel):
    invoice_number: str | None
    vendor_name: str | None
    invoice_date: date | None
    total_amount: Decimal | None
    currency: str | None

    source_document_id: str
    source_version_id: str
    source_page: int | None = None
    source_sheet: str | None = None
    source_cell_range: str | None = None
    source_span_id: str | None = None

Then store it as a structured row.

Possible tables:

invoices
invoice_line_items
financial_metrics
period_values
entities
source_refs

If you want semantic search over extracted records, you can also create a text summary and embed that:

Invoice INV-2024-018 from Vendor ABC, dated 2024-03-18, total ₹125,000, source: doc_123 page 4.

But the embedding should point back to the structured row, not replace it.

So I would use both:

Object	Store where?	Why
Typed record	SQL / relational DB	source of truth
Source span/cell	provenance tables	citation/highlight
Original text/table chunk	document store	audit/debug
Embedding	vector index / pgvector / Qdrant	recall
Text summary of typed record	optional embedding input	semantic lookup

The risky design is:

typed fact -> only embedding -> retrieve later

The safer design is:

typed fact -> validated SQL row -> evidence link -> optional embedding

3. Structured output helps, but does not prove correctness

Local models can be used for typed extraction. I would just frame it as an evaluation question rather than a yes/no model-size question.

A better question is:

Can this local model extract my target fields from my real files, with valid schema, correct values, and source evidence links?

There are several tools/patterns worth knowing:

LlamaIndex structured extraction uses typed/Pydantic-style extraction patterns.
LangChain structured output supports schema-based outputs.
Ollama structured outputs supports JSON Schema-style constrained outputs.
vLLM structured outputs supports JSON Schema / Pydantic-style structured output.
Outlines is useful if you want provider-independent structured generation.
Instructor is useful for Pydantic validation/retry style extraction.
llama.cpp grammars are useful if you are running local constrained decoding.

But the important caveat is:

Structured output can make the output parseable. It does not automatically make the extracted value correct.

For example, this can be valid JSON and still wrong:

{
  "metric_name": "Gross Profit",
  "period": "FY24",
  "value": 1900000,
  "currency": "INR",
  "source_cell": "D17"
}

The JSON may be valid. The Pydantic model may validate. But the value may still not match the sheet.

So I would use a pipeline like:

LLM extraction
  -> schema validation
  -> deterministic checks where possible
  -> evidence/source check
  -> retry or mark uncertain
  -> store structured record

For invoices, dates, amounts, currencies, fiscal years, and spreadsheet cells, many checks can be partly deterministic.

Examples:

Field	Possible check
amount	parse as decimal, compare against source span/cell
currency	normalize symbol/code
date	parse date and keep original string
fiscal year	normalize FY23 / FY2023 / 2023
cell reference	verify sheet/cell exists
invoice total	compare against table total if available
source quote	check quote appears in source span
source cell	check extracted value matches cell/range

This is why I would store both the extracted value and its evidence reference.

4. CSV/XLSX: do not start with text chunks

For CSV/XLSX, I would not begin with normal text chunking.

Think in:

workbook
  -> sheet
  -> table / detected region
  -> row
  -> column
  -> cell
  -> range
  -> header context

CSV is usually closer to a table. XLSX can be a table, a report, a financial model, or a visual workbook.

So I would treat them differently.

Format	First representation
CSV	DataFrame / SQL table
Simple XLSX table	DataFrame + sheet/cell provenance
Messy XLSX report	sheet/region/table/cell model
Financial model	workbook graph + formulas + values
Highlight/export target	workbook/sheet/cell/range provenance

For table-like data, pandas read_csv and pandas read_excel are natural starting points.

For cell-level provenance, formulas, styles, and annotated XLSX export, openpyxl is useful.

A practical rule:

Use DataFrames for table-like regions, but keep workbook/sheet/cell provenance separately.

Do not flatten XLSX into plain text too early if you need citations or highlights.

5. Chunk overlap for spreadsheets

For normal prose documents, chunk overlap can help because sentences/paragraphs cross chunk boundaries.

For spreadsheets, overlap is usually not the main question.

A better question is:

What is the correct table/window/header context for this row or cell?

For a spreadsheet row, the useful context may be:

sheet name
table title
section title
row label
column label
nearby headers
units
currency
fiscal period
notes
formula
displayed value

A good spreadsheet “chunk” might be:

sheet + detected table + header rows + row group + relevant rows

or:

cell/range + row header context + column header context + surrounding table metadata

So instead of arbitrary text overlap, I would use structure-aware windows.

A simple approach:

Unit	Good for
whole sheet summary	sheet-level routing
detected table summary	choosing relevant table
row-level record	filtering/search
cell/range evidence	citation/highlight
row window with headers	LLM explanation
structured SQL row	numeric query

There is also research around structure-aware chunking for tabular data, which is relevant because ordinary text chunking does not preserve table structure well.

6. Should CSV/XLSX become generalized typed models?

I would not use one universal typed model for everything.

I would use a few domain-specific typed models plus a generic source reference.

For example:

Document
DocumentVersion
SourceRef
ParsedTable
ParsedCell
Invoice
InvoiceLineItem
FinancialMetric
PeriodValue
Entity

SourceRef is the glue.

Example:

{
  "source_ref_id": "src_8821",
  "document_id": "doc_123",
  "document_version_id": "v3",
  "page": null,
  "sheet": "FY24",
  "cell_range": "D17:D18",
  "span_id": null,
  "bbox": null,
  "source_text_hash": "..."
}

Then an extracted fact can point to source_ref_id.

That gives you:

SQL queryability
auditability
citation
highlight/export
reprocessing safety

7. What to embed

I would embed different things for different purposes.

Embedding input	Purpose
original text chunks	semantic retrieval over prose
table summaries	route to relevant table
row summaries	semantic lookup over rows
typed record summaries	semantic lookup over extracted facts
document summaries	coarse routing
raw numeric-only cells	usually weak for embeddings

For example, this is more useful to embed than just 1250000:

FY24 Gross Profit is ₹1,250,000, extracted from sheet FY24 cell D17 in document doc_123.

But the value should still live in SQL as a numeric value.

That way:

semantic search finds candidate records
SQL does exact filtering/comparison
source refs provide citation/highlight

8. Example flow: invoice threshold query

For:

Show invoices > ₹1 lakh

I would not rely on embeddings.

I would do:

parse documents
  -> extract Invoice records
  -> validate amount/currency/date
  -> store in SQL
  -> SQL query total_amount > 100000
  -> fetch source refs
  -> render answer with citations/highlights

Embeddings may help if the user asks a fuzzy question like:

Find large vendor bills related to consulting.

But once the candidate domain is identified, the threshold should be SQL.

9. Example flow: FY23 vs FY24 gross profit

For:

Compare FY23 vs FY24 gross profit

I would aim for:

parse spreadsheet/PDF table
  -> identify relevant table/metric
  -> extract FinancialMetric records
  -> normalize fiscal periods
  -> store values in SQL
  -> compare numerically
  -> cite source cells/spans

Possible typed model:

from decimal import Decimal
from pydantic import BaseModel

class FinancialMetric(BaseModel):
    metric_name: str
    period: str
    value: Decimal
    currency: str | None = None
    unit: str | None = None
    source_ref_id: str

Again, embeddings can help find “gross profit” semantically, but the comparison itself should be structured.

10. Suggested minimal implementation

If you want a simple version first:

1. Parse CSV/XLSX into DataFrames where possible.
2. Preserve sheet/cell/range provenance.
3. Define 2-3 typed models only.
4. Use local structured output to extract records.
5. Validate records.
6. Verify source refs and deterministic values where possible.
7. Store records in Postgres.
8. Store source refs separately.
9. Embed source chunks and optional record summaries.
10. Use SQL for structured questions.
11. Use embeddings for fuzzy recall.

Start small.

Maybe begin with:

Invoice
FinancialMetric
SourceRef

Then expand only when needed.

A slightly more concrete data model (click for more details) Possible extraction loop (click for more details) Tiny eval set I would build early (click for more details)

My practical answer

So for your specific question:

Should the LLM extract typed models from chunk text and store them in embeddings?

I would answer:

Let the LLM extract typed models from parsed text/table/cell context, but store those typed models as validated structured records linked to evidence. Use embeddings only as a retrieval index over chunks, table summaries, row summaries, or record summaries.

And for CSV/XLSX:

Do not treat them as prose documents first. Treat them as tables/sheets/cells/ranges first, and only create text representations after preserving that structure.

That separation should keep the system much easier to debug later.