Invoice Data Recognition
Hmm… While commercial OCR services may include such features, standalone OCR models are often not very good at properly interpreting multi-page data. This is because, in most cases, the models are primarily trained on pairs of a single page and the information to be extracted…
The most straightforward workaround is to split the document into individual pages before feeding them to the OCR model:
Your old approach broke for a structural reason, not just because the OCR was free.
You were effectively doing:
500+ invoices in one PDF → OCR everything → flatten to one text stream → run regex likeinvoice number \d{4,6}
That is brittle because PDF/OCR extraction often does not preserve normal reading order. PyMuPDF’s docs say plain PDF text extraction may come out “not in usual reading order,” with unexpected line breaks, and recommend using blocks or words with position data instead. (PyMuPDF)
So the main fix is not “use a better regex.” The main fix is:
split first, extract locally, keep coordinates, then validate the totals. Google’s Custom Splitter exists specifically to split packed PDFs into logical documents before extraction, and Google notes that bad splits are especially damaging because one split error causes downstream extraction errors. (Google Cloud Documentation)
What to do with shipping, freight, and discounts
Treat them as separate normalized fields in your schema. Do not fold them into one generic “total adjustment.”
A practical invoice schema for your case is:
{
"invoice_id": "123456",
"vendor_name": "ACME Supplies Ltd",
"invoice_date": "2026-03-15",
"currency": "USD",
"subtotal": 1000.00,
"line_item_discount_total": 20.00,
"invoice_level_discount": 10.00,
"shipping_charge": 15.00,
"freight_charge": 40.00,
"handling_charge": 0.00,
"tax_total": 102.50,
"invoice_total": 1127.50,
"amount_due": 1127.50
}
And also store the raw label text that produced each field:
raw_label = "Shipping"raw_label = "Shipping & Handling"raw_label = "Freight"raw_label = "Discount"
That matters because standard parsers do not always match your accounting distinctions exactly. AWS Textract explicitly standardizes DISCOUNT and SHIPPING_HANDLING_CHARGE. Azure’s invoice model extracts invoice fields and line items into structured JSON. Microsoft Dynamics’ invoice entity explicitly models FreightAmount, TotalDiscountAmount, TotalLineItemAmount, TotalAmountLessFreight, and TotalTax, which is close to the accounting structure you need. (AWS Document)
The formula to validate
Use arithmetic validation as a hard gate.
A practical rule is:
invoice_total
≈ subtotal
- line_item_discount_total
- invoice_level_discount
+ shipping_charge
+ freight_charge
+ handling_charge
+ tax_total
+ other_surcharges
And if the invoice has prior balance or prior credits:
amount_due
≈ invoice_total
+ previous_unpaid_balance
- credits_or_payments
This is not cosmetic cleanup. It is your error detector. If the parser confuses freight with a line item, or misses a discount, this check will usually fail.
Why your invoice number was missed
Your regex expected the text to appear like this:
Invoice Number 123456
But OCR/PDF extraction often returns something more like:
Invoice
Date
123456
Number
or mixes it with neighboring text from another block. PyMuPDF’s docs describe exactly this kind of issue and recommend using block and word extraction with coordinates to rebuild reading order or search local rectangles instead of relying on one global text stream. (PyMuPDF)
So instead of searching the full document with:
invoice number \d{4,6}
do this:
- find the header region
- find labels such as
Invoice No,Invoice Number,Invoice # - collect candidate values near those labels
- rank them by distance and alignment
- then validate the winner with
^\d{4,6}$
That changes regex from a discovery method into a validator. That is much more reliable.
The concrete pipeline I would use
1. Split the packed PDF into individual invoices
This is the first change.
Start with page-level signals:
Invoicenear the top- an invoice-number/date block near the header
- totals block near the bottom
- repeated vendor header/logo
- continuation pages with line-item tables but no new invoice header
Google’s Custom Splitter is built around exactly this use case: composite files containing multiple logical documents that then get routed to the appropriate extractor. (Google Cloud Documentation)
2. Use native PDF text before OCR when possible
If a page is born-digital, extract words and blocks directly from the PDF first. PyMuPDF recommends block and word extraction because plain text order may be wrong, and Page.get_text("blocks") / Page.get_text("words") preserve useful position information. (PyMuPDF)
3. Use document OCR only for scanned pages
For scanned pages or images, use invoice/document AI OCR rather than generic OCR-only tooling. Azure’s invoice model is built to handle phone captures, scanned documents, and digital PDFs, and returns recognized text, tables, and invoice-specific fields plus line items. AWS Textract’s invoice/receipt path similarly outputs structured summary fields and line items instead of one text blob. (Microsoft Learn)
4. Keep coordinates in your intermediate data
For each word, keep:
- page number
- text
- bounding box
- line ID
- block ID
- confidence
- source type: native PDF or OCR
This is what lets you ask useful questions like “what is near the invoice-number label?” instead of “does the whole OCR blob contain the pattern?”
5. Zone the page before extracting fields
Split each invoice into approximate regions:
- header
- vendor/bill-to area
- line-item area
- totals area
- footer/remittance area
Then only search:
- invoice number and date in the header
- shipping/freight/discount/tax/total in the totals area
- products, qty, price, amount in the line-item area
This mirrors how invoice parsers expose output: Azure returns text, tables, and invoice-specific fields; AWS separates summary fields and line items. (Microsoft Learn)
6. Treat charges as labeled totals lines
Inside the totals block, extract a list of labeled amount lines:
| Raw label | Internal field |
|---|---|
| Shipping | shipping_charge |
| Shipping & Handling | shipping_charge or split later |
| Freight | freight_charge |
| Discount | invoice_level_discount |
| Rebate | invoice_level_discount |
Because your accounting system distinguishes freight from shipping, do not collapse them automatically.
7. Reconstruct line items separately
Do not use header-field logic for line items.
For line items, use a table or pseudo-table approach:
- detect numeric columns on the right
- group words into rows by vertical overlap
- treat left text as description
- merge multiline descriptions when there is no new numeric anchor
That is where invoice extraction usually becomes hard.
Best practical options
Fastest path
Benchmark a purpose-built invoice parser first.
Good starting options are:
- Google Document AI: Custom Splitter + Invoice Parser + uptraining/custom fields. Google explicitly says you can uptrain the Invoice Parser with your own data and add custom fields that are not supported by the pretrained model. That is directly useful for a field like
freight_charge. (Google Cloud Documentation) - AWS Textract
AnalyzeExpense: it already standardizesDISCOUNTandSHIPPING_HANDLING_CHARGE, plus summary fields and line items. (AWS Document) - Azure Document Intelligence invoice model: it handles scanned images, PDFs, and line items in structured JSON. (Microsoft Learn)
Strong custom path
If you want to own the stack:
- split packed PDFs first
- use PyMuPDF blocks/words for digital PDFs
- use OCR only for scanned pages
- keep coordinates
- zone header/totals/items separately
- extract local candidates near labels
- normalize freight, shipping, discount, tax
- reconcile the math before posting anything
My concrete advice for you
For your situation, I would do this in order:
Phase 1
Take 30 to 50 invoices from the packed PDF and manually create a small gold set:
- correct invoice boundaries
- correct invoice number
- subtotal
- discount
- shipping
- freight
- tax
- total
- amount due
Phase 2
Test two paths:
- managed invoice parser with splitting
- native PDF extraction + local rules on already-split invoices
That will tell you very quickly whether your real bottleneck is:
- split detection
- OCR quality
- reading order
- totals parsing
- line-item grouping
Phase 3
Lock your schema before tuning models:
subtotalline_item_discount_totalinvoice_level_discountshipping_chargefreight_chargetax_totalinvoice_totalamount_due
Phase 4
Add reconciliation rules and reject anything that does not balance.
Bottom line
Your earlier failure does not mean invoice extraction is a bad fit.
It means the earlier workflow was fragile:
- too many invoices in one PDF
- flattened OCR text
- regex dependent on OCR order
A stronger workflow is:
packed PDF → split into invoices → extract with coordinates → parse header/totals/lines separately → keep freight separate from shipping → keep discounts explicit → validate the math
That is the path I would take. (Google Cloud Documentation)
Discussion in the ATmosphere