External Publication

Invoice Data Recognition

Hugging Face Forums [Unofficial] April 1, 2026

Hmm… While commercial OCR services may include such features, standalone OCR models are often not very good at properly interpreting multi-page data. This is because, in most cases, the models are primarily trained on pairs of a single page and the information to be extracted…

The most straightforward workaround is to split the document into individual pages before feeding them to the OCR model:

Your old approach broke for a structural reason, not just because the OCR was free.

You were effectively doing:

500+ invoices in one PDF → OCR everything → flatten to one text stream → run regex likeinvoice number \d{4,6}

That is brittle because PDF/OCR extraction often does not preserve normal reading order. PyMuPDF’s docs say plain PDF text extraction may come out “not in usual reading order,” with unexpected line breaks, and recommend using blocks or words with position data instead. (PyMuPDF)

So the main fix is not “use a better regex.” The main fix is:

split first, extract locally, keep coordinates, then validate the totals. Google’s Custom Splitter exists specifically to split packed PDFs into logical documents before extraction, and Google notes that bad splits are especially damaging because one split error causes downstream extraction errors. (Google Cloud Documentation)

What to do with shipping, freight, and discounts

Treat them as separate normalized fields in your schema. Do not fold them into one generic “total adjustment.”

A practical invoice schema for your case is:

{
  "invoice_id": "123456",
  "vendor_name": "ACME Supplies Ltd",
  "invoice_date": "2026-03-15",
  "currency": "USD",

  "subtotal": 1000.00,
  "line_item_discount_total": 20.00,
  "invoice_level_discount": 10.00,
  "shipping_charge": 15.00,
  "freight_charge": 40.00,
  "handling_charge": 0.00,
  "tax_total": 102.50,
  "invoice_total": 1127.50,
  "amount_due": 1127.50
}

And also store the raw label text that produced each field:

raw_label = "Shipping"
raw_label = "Shipping & Handling"
raw_label = "Freight"
raw_label = "Discount"

That matters because standard parsers do not always match your accounting distinctions exactly. AWS Textract explicitly standardizes DISCOUNT and SHIPPING_HANDLING_CHARGE. Azure’s invoice model extracts invoice fields and line items into structured JSON. Microsoft Dynamics’ invoice entity explicitly models FreightAmount, TotalDiscountAmount, TotalLineItemAmount, TotalAmountLessFreight, and TotalTax, which is close to the accounting structure you need. (AWS Document)

The formula to validate

Use arithmetic validation as a hard gate.

A practical rule is:

invoice_total
≈ subtotal
- line_item_discount_total
- invoice_level_discount
+ shipping_charge
+ freight_charge
+ handling_charge
+ tax_total
+ other_surcharges

And if the invoice has prior balance or prior credits:

amount_due
≈ invoice_total
+ previous_unpaid_balance
- credits_or_payments

This is not cosmetic cleanup. It is your error detector. If the parser confuses freight with a line item, or misses a discount, this check will usually fail.

Why your invoice number was missed

Your regex expected the text to appear like this:

Invoice Number 123456

But OCR/PDF extraction often returns something more like:

Invoice
Date
123456
Number

or mixes it with neighboring text from another block. PyMuPDF’s docs describe exactly this kind of issue and recommend using block and word extraction with coordinates to rebuild reading order or search local rectangles instead of relying on one global text stream. (PyMuPDF)

So instead of searching the full document with:

invoice number \d{4,6}

do this:

find the header region
find labels such as Invoice No, Invoice Number, Invoice #
collect candidate values near those labels
rank them by distance and alignment
then validate the winner with ^\d{4,6}$

That changes regex from a discovery method into a validator. That is much more reliable.

The concrete pipeline I would use

1. Split the packed PDF into individual invoices

This is the first change.

Start with page-level signals:

Invoice near the top
an invoice-number/date block near the header
totals block near the bottom
repeated vendor header/logo
continuation pages with line-item tables but no new invoice header

Google’s Custom Splitter is built around exactly this use case: composite files containing multiple logical documents that then get routed to the appropriate extractor. (Google Cloud Documentation)

2. Use native PDF text before OCR when possible

If a page is born-digital, extract words and blocks directly from the PDF first. PyMuPDF recommends block and word extraction because plain text order may be wrong, and Page.get_text("blocks") / Page.get_text("words") preserve useful position information. (PyMuPDF)

3. Use document OCR only for scanned pages

For scanned pages or images, use invoice/document AI OCR rather than generic OCR-only tooling. Azure’s invoice model is built to handle phone captures, scanned documents, and digital PDFs, and returns recognized text, tables, and invoice-specific fields plus line items. AWS Textract’s invoice/receipt path similarly outputs structured summary fields and line items instead of one text blob. (Microsoft Learn)

4. Keep coordinates in your intermediate data

For each word, keep:

page number
text
bounding box
line ID
block ID
confidence
source type: native PDF or OCR

This is what lets you ask useful questions like “what is near the invoice-number label?” instead of “does the whole OCR blob contain the pattern?”

5. Zone the page before extracting fields

Split each invoice into approximate regions:

header
vendor/bill-to area
line-item area
totals area
footer/remittance area

Then only search:

invoice number and date in the header
shipping/freight/discount/tax/total in the totals area
products, qty, price, amount in the line-item area

This mirrors how invoice parsers expose output: Azure returns text, tables, and invoice-specific fields; AWS separates summary fields and line items. (Microsoft Learn)

6. Treat charges as labeled totals lines

Inside the totals block, extract a list of labeled amount lines:

Raw label	Internal field
Shipping	`shipping_charge`
Shipping & Handling	`shipping_charge` or split later
Freight	`freight_charge`
Discount	`invoice_level_discount`
Rebate	`invoice_level_discount`

Because your accounting system distinguishes freight from shipping, do not collapse them automatically.

7. Reconstruct line items separately

Do not use header-field logic for line items.

For line items, use a table or pseudo-table approach:

detect numeric columns on the right
group words into rows by vertical overlap
treat left text as description
merge multiline descriptions when there is no new numeric anchor

That is where invoice extraction usually becomes hard.

Best practical options

Fastest path

Benchmark a purpose-built invoice parser first.

Good starting options are:

Google Document AI: Custom Splitter + Invoice Parser + uptraining/custom fields. Google explicitly says you can uptrain the Invoice Parser with your own data and add custom fields that are not supported by the pretrained model. That is directly useful for a field like freight_charge. (Google Cloud Documentation)
AWS Textract AnalyzeExpense: it already standardizes DISCOUNT and SHIPPING_HANDLING_CHARGE, plus summary fields and line items. (AWS Document)
Azure Document Intelligence invoice model: it handles scanned images, PDFs, and line items in structured JSON. (Microsoft Learn)

Strong custom path

If you want to own the stack:

split packed PDFs first
use PyMuPDF blocks/words for digital PDFs
use OCR only for scanned pages
keep coordinates
zone header/totals/items separately
extract local candidates near labels
normalize freight, shipping, discount, tax
reconcile the math before posting anything

My concrete advice for you

For your situation, I would do this in order:

Phase 1

Take 30 to 50 invoices from the packed PDF and manually create a small gold set:

correct invoice boundaries
correct invoice number
subtotal
discount
shipping
freight
tax
total
amount due

Phase 2

Test two paths:

managed invoice parser with splitting
native PDF extraction + local rules on already-split invoices

That will tell you very quickly whether your real bottleneck is:

split detection
OCR quality
reading order
totals parsing
line-item grouping

Phase 3

Lock your schema before tuning models:

subtotal
line_item_discount_total
invoice_level_discount
shipping_charge
freight_charge
tax_total
invoice_total
amount_due

Phase 4

Add reconciliation rules and reject anything that does not balance.

Bottom line

Your earlier failure does not mean invoice extraction is a bad fit.

It means the earlier workflow was fragile:

too many invoices in one PDF
flattened OCR text
regex dependent on OCR order

A stronger workflow is:

packed PDF → split into invoices → extract with coordinates → parse header/totals/lines separately → keep freight separate from shipping → keep discounts explicit → validate the math

That is the path I would take. (Google Cloud Documentation)