External Publication
Visit Post

Help with a Local Document RAG System (Storage + Ingestion + Query + Highlighting)

Hugging Face Forums [Unofficial] June 23, 2026
Source

I think structured output often depends less on the model alone and more on how robust the checking/verification layer is:


Short version

I would separate the pipeline like this:

Parser extracts structure.
LLM extracts semantics.
Structured output constrains shape.
Validation checks types.
Verification checks evidence.
SQL stores facts.
Embeddings help retrieval.
Citations/highlights point back to spans/cells.

So, no, I would not store the typed model primarily as embeddings.

I would store typed extraction results as validated structured records, linked back to the source evidence. Then I would optionally embed either the original source chunk or a short textual summary of the extracted record for semantic recall.

A rough flow:

raw document
  -> parsed text/tables/cells
  -> typed extraction
  -> validation
  -> evidence verification
  -> SQL rows
  -> source refs
  -> optional embeddings

Embeddings are useful as an index. They should not be the source of truth for structured facts.


1. Parser vs extractor

I would keep these two concepts separate.

Layer Job
Parser Recover document/table/cell structure
Normalizer Turn messy pages/sheets/tables into stable internal records
Extractor Produce typed objects like Invoice or FinancialMetric
Structured output layer Make the model response parseable
Validator Check schema, types, required fields, ranges
Verifier Check source span/cell and deterministic values
SQL store Store extracted facts as queryable records
Embeddings Support fuzzy/semantic retrieval
Citation layer Point answers back to pages/spans/cells

A parser might tell you:

{
  "sheet": "FY24",
  "cell": "D17",
  "value": "1250000",
  "number_format": "₹#,##0",
  "row_header": "Gross Profit",
  "column_header": "FY24"
}

An extractor might turn that into:

{
  "metric_name": "Gross Profit",
  "period": "FY24",
  "value": 1250000,
  "currency": "INR",
  "source": {
    "document_id": "doc_123",
    "document_version_id": "v3",
    "sheet": "FY24",
    "cell": "D17"
  }
}

Those are different layers.

The parser preserves structure. The extractor assigns meaning.


2. Where typed models should go

For typed data, I would use something like Pydantic / JSON Schema / SQL tables.

For example:

from datetime import date
from decimal import Decimal
from pydantic import BaseModel

class Invoice(BaseModel):
    invoice_number: str | None
    vendor_name: str | None
    invoice_date: date | None
    total_amount: Decimal | None
    currency: str | None

    source_document_id: str
    source_version_id: str
    source_page: int | None = None
    source_sheet: str | None = None
    source_cell_range: str | None = None
    source_span_id: str | None = None

Then store it as a structured row.

Possible tables:

invoices
invoice_line_items
financial_metrics
period_values
entities
source_refs

If you want semantic search over extracted records, you can also create a text summary and embed that:

Invoice INV-2024-018 from Vendor ABC, dated 2024-03-18, total ₹125,000, source: doc_123 page 4.

But the embedding should point back to the structured row, not replace it.

So I would use both:

Object Store where? Why
Typed record SQL / relational DB source of truth
Source span/cell provenance tables citation/highlight
Original text/table chunk document store audit/debug
Embedding vector index / pgvector / Qdrant recall
Text summary of typed record optional embedding input semantic lookup

The risky design is:

typed fact -> only embedding -> retrieve later

The safer design is:

typed fact -> validated SQL row -> evidence link -> optional embedding

3. Structured output helps, but does not prove correctness

Local models can be used for typed extraction. I would just frame it as an evaluation question rather than a yes/no model-size question.

A better question is:

Can this local model extract my target fields from my real files, with valid schema, correct values, and source evidence links?

There are several tools/patterns worth knowing:

  • LlamaIndex structured extraction uses typed/Pydantic-style extraction patterns.
  • LangChain structured output supports schema-based outputs.
  • Ollama structured outputs supports JSON Schema-style constrained outputs.
  • vLLM structured outputs supports JSON Schema / Pydantic-style structured output.
  • Outlines is useful if you want provider-independent structured generation.
  • Instructor is useful for Pydantic validation/retry style extraction.
  • llama.cpp grammars are useful if you are running local constrained decoding.

But the important caveat is:

Structured output can make the output parseable. It does not automatically make the extracted value correct.

For example, this can be valid JSON and still wrong:

{
  "metric_name": "Gross Profit",
  "period": "FY24",
  "value": 1900000,
  "currency": "INR",
  "source_cell": "D17"
}

The JSON may be valid. The Pydantic model may validate. But the value may still not match the sheet.

So I would use a pipeline like:

LLM extraction
  -> schema validation
  -> deterministic checks where possible
  -> evidence/source check
  -> retry or mark uncertain
  -> store structured record

For invoices, dates, amounts, currencies, fiscal years, and spreadsheet cells, many checks can be partly deterministic.

Examples:

Field Possible check
amount parse as decimal, compare against source span/cell
currency normalize symbol/code
date parse date and keep original string
fiscal year normalize FY23 / FY2023 / 2023
cell reference verify sheet/cell exists
invoice total compare against table total if available
source quote check quote appears in source span
source cell check extracted value matches cell/range

This is why I would store both the extracted value and its evidence reference.


4. CSV/XLSX: do not start with text chunks

For CSV/XLSX, I would not begin with normal text chunking.

Think in:

workbook
  -> sheet
  -> table / detected region
  -> row
  -> column
  -> cell
  -> range
  -> header context

CSV is usually closer to a table. XLSX can be a table, a report, a financial model, or a visual workbook.

So I would treat them differently.

Format First representation
CSV DataFrame / SQL table
Simple XLSX table DataFrame + sheet/cell provenance
Messy XLSX report sheet/region/table/cell model
Financial model workbook graph + formulas + values
Highlight/export target workbook/sheet/cell/range provenance

For table-like data, pandas read_csv and pandas read_excel are natural starting points.

For cell-level provenance, formulas, styles, and annotated XLSX export, openpyxl is useful.

A practical rule:

Use DataFrames for table-like regions, but keep workbook/sheet/cell provenance separately.

Do not flatten XLSX into plain text too early if you need citations or highlights.


5. Chunk overlap for spreadsheets

For normal prose documents, chunk overlap can help because sentences/paragraphs cross chunk boundaries.

For spreadsheets, overlap is usually not the main question.

A better question is:

What is the correct table/window/header context for this row or cell?

For a spreadsheet row, the useful context may be:

  • sheet name
  • table title
  • section title
  • row label
  • column label
  • nearby headers
  • units
  • currency
  • fiscal period
  • notes
  • formula
  • displayed value

A good spreadsheet “chunk” might be:

sheet + detected table + header rows + row group + relevant rows

or:

cell/range + row header context + column header context + surrounding table metadata

So instead of arbitrary text overlap, I would use structure-aware windows.

A simple approach:

Unit Good for
whole sheet summary sheet-level routing
detected table summary choosing relevant table
row-level record filtering/search
cell/range evidence citation/highlight
row window with headers LLM explanation
structured SQL row numeric query

There is also research around structure-aware chunking for tabular data, which is relevant because ordinary text chunking does not preserve table structure well.


6. Should CSV/XLSX become generalized typed models?

I would not use one universal typed model for everything.

I would use a few domain-specific typed models plus a generic source reference.

For example:

Document
DocumentVersion
SourceRef
ParsedTable
ParsedCell
Invoice
InvoiceLineItem
FinancialMetric
PeriodValue
Entity

SourceRef is the glue.

Example:

{
  "source_ref_id": "src_8821",
  "document_id": "doc_123",
  "document_version_id": "v3",
  "page": null,
  "sheet": "FY24",
  "cell_range": "D17:D18",
  "span_id": null,
  "bbox": null,
  "source_text_hash": "..."
}

Then an extracted fact can point to source_ref_id.

That gives you:

  • SQL queryability
  • auditability
  • citation
  • highlight/export
  • reprocessing safety

7. What to embed

I would embed different things for different purposes.

Embedding input Purpose
original text chunks semantic retrieval over prose
table summaries route to relevant table
row summaries semantic lookup over rows
typed record summaries semantic lookup over extracted facts
document summaries coarse routing
raw numeric-only cells usually weak for embeddings

For example, this is more useful to embed than just 1250000:

FY24 Gross Profit is ₹1,250,000, extracted from sheet FY24 cell D17 in document doc_123.

But the value should still live in SQL as a numeric value.

That way:

  • semantic search finds candidate records
  • SQL does exact filtering/comparison
  • source refs provide citation/highlight

8. Example flow: invoice threshold query

For:

Show invoices > ₹1 lakh

I would not rely on embeddings.

I would do:

parse documents
  -> extract Invoice records
  -> validate amount/currency/date
  -> store in SQL
  -> SQL query total_amount > 100000
  -> fetch source refs
  -> render answer with citations/highlights

Embeddings may help if the user asks a fuzzy question like:

Find large vendor bills related to consulting.

But once the candidate domain is identified, the threshold should be SQL.


9. Example flow: FY23 vs FY24 gross profit

For:

Compare FY23 vs FY24 gross profit

I would aim for:

parse spreadsheet/PDF table
  -> identify relevant table/metric
  -> extract FinancialMetric records
  -> normalize fiscal periods
  -> store values in SQL
  -> compare numerically
  -> cite source cells/spans

Possible typed model:

from decimal import Decimal
from pydantic import BaseModel

class FinancialMetric(BaseModel):
    metric_name: str
    period: str
    value: Decimal
    currency: str | None = None
    unit: str | None = None
    source_ref_id: str

Again, embeddings can help find “gross profit” semantically, but the comparison itself should be structured.


10. Suggested minimal implementation

If you want a simple version first:

1. Parse CSV/XLSX into DataFrames where possible.
2. Preserve sheet/cell/range provenance.
3. Define 2-3 typed models only.
4. Use local structured output to extract records.
5. Validate records.
6. Verify source refs and deterministic values where possible.
7. Store records in Postgres.
8. Store source refs separately.
9. Embed source chunks and optional record summaries.
10. Use SQL for structured questions.
11. Use embeddings for fuzzy recall.

Start small.

Maybe begin with:

Invoice
FinancialMetric
SourceRef

Then expand only when needed.


A slightly more concrete data model (click for more details) Possible extraction loop (click for more details) Tiny eval set I would build early (click for more details)

My practical answer

So for your specific question:

Should the LLM extract typed models from chunk text and store them in embeddings?

I would answer:

Let the LLM extract typed models from parsed text/table/cell context, but store those typed models as validated structured records linked to evidence. Use embeddings only as a retrieval index over chunks, table summaries, row summaries, or record summaries.

And for CSV/XLSX:

Do not treat them as prose documents first. Treat them as tables/sheets/cells/ranges first, and only create text representations after preserving that structure.

That separation should keep the system much easier to debug later.

Discussion in the ATmosphere

Loading comments...