Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreic42i46modvycbovqqvy36jr4nx6wnky3gozhr7hys45rvfzt26uq",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mifsdqojaf72"
  },
  "path": "/t/invoice-data-recognition/174564#post_4",
  "publishedAt": "2026-04-01T03:08:12.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "PyMuPDF",
    "Google Cloud Documentation",
    "AWS Document",
    "Microsoft Learn"
  ],
  "textContent": "Hmm… While commercial OCR services may include such features, standalone OCR models are often not very good at properly interpreting multi-page data. This is because, in most cases, the models are primarily trained on pairs of a single page and the information to be extracted…\n\nThe most straightforward workaround is to split the document into individual pages before feeding them to the OCR model:\n\n* * *\n\nYour old approach broke for a **structural** reason, not just because the OCR was free.\n\nYou were effectively doing:\n\n**500+ invoices in one PDF → OCR everything → flatten to one text stream → run regex like`invoice number \\d{4,6}`**\n\nThat is brittle because PDF/OCR extraction often does **not** preserve normal reading order. PyMuPDF’s docs say plain PDF text extraction may come out “not in usual reading order,” with unexpected line breaks, and recommend using blocks or words with position data instead. (PyMuPDF)\n\nSo the main fix is not “use a better regex.” The main fix is:\n\n**split first, extract locally, keep coordinates, then validate the totals.** Google’s Custom Splitter exists specifically to split packed PDFs into logical documents before extraction, and Google notes that bad splits are especially damaging because one split error causes downstream extraction errors. (Google Cloud Documentation)\n\n## What to do with shipping, freight, and discounts\n\nTreat them as **separate normalized fields** in your schema. Do not fold them into one generic “total adjustment.”\n\nA practical invoice schema for your case is:\n\n\n    {\n      \"invoice_id\": \"123456\",\n      \"vendor_name\": \"ACME Supplies Ltd\",\n      \"invoice_date\": \"2026-03-15\",\n      \"currency\": \"USD\",\n\n      \"subtotal\": 1000.00,\n      \"line_item_discount_total\": 20.00,\n      \"invoice_level_discount\": 10.00,\n      \"shipping_charge\": 15.00,\n      \"freight_charge\": 40.00,\n      \"handling_charge\": 0.00,\n      \"tax_total\": 102.50,\n      \"invoice_total\": 1127.50,\n      \"amount_due\": 1127.50\n    }\n\n\nAnd also store the **raw label text** that produced each field:\n\n  * `raw_label = \"Shipping\"`\n  * `raw_label = \"Shipping & Handling\"`\n  * `raw_label = \"Freight\"`\n  * `raw_label = \"Discount\"`\n\n\n\nThat matters because standard parsers do not always match your accounting distinctions exactly. AWS Textract explicitly standardizes `DISCOUNT` and `SHIPPING_HANDLING_CHARGE`. Azure’s invoice model extracts invoice fields and line items into structured JSON. Microsoft Dynamics’ invoice entity explicitly models `FreightAmount`, `TotalDiscountAmount`, `TotalLineItemAmount`, `TotalAmountLessFreight`, and `TotalTax`, which is close to the accounting structure you need. (AWS Document)\n\n## The formula to validate\n\nUse arithmetic validation as a hard gate.\n\nA practical rule is:\n\n\n    invoice_total\n    ≈ subtotal\n    - line_item_discount_total\n    - invoice_level_discount\n    + shipping_charge\n    + freight_charge\n    + handling_charge\n    + tax_total\n    + other_surcharges\n\n\nAnd if the invoice has prior balance or prior credits:\n\n\n    amount_due\n    ≈ invoice_total\n    + previous_unpaid_balance\n    - credits_or_payments\n\n\nThis is not cosmetic cleanup. It is your error detector. If the parser confuses freight with a line item, or misses a discount, this check will usually fail.\n\n## Why your invoice number was missed\n\nYour regex expected the text to appear like this:\n\n\n    Invoice Number 123456\n\n\nBut OCR/PDF extraction often returns something more like:\n\n\n    Invoice\n    Date\n    123456\n    Number\n\n\nor mixes it with neighboring text from another block. PyMuPDF’s docs describe exactly this kind of issue and recommend using block and word extraction with coordinates to rebuild reading order or search local rectangles instead of relying on one global text stream. (PyMuPDF)\n\nSo instead of searching the full document with:\n\n\n    invoice number \\d{4,6}\n\n\ndo this:\n\n  1. find the **header region**\n  2. find labels such as `Invoice No`, `Invoice Number`, `Invoice #`\n  3. collect candidate values **near those labels**\n  4. rank them by distance and alignment\n  5. then validate the winner with `^\\d{4,6}$`\n\n\n\nThat changes regex from a discovery method into a validator. That is much more reliable.\n\n## The concrete pipeline I would use\n\n### 1. Split the packed PDF into individual invoices\n\nThis is the first change.\n\nStart with page-level signals:\n\n  * `Invoice` near the top\n  * an invoice-number/date block near the header\n  * totals block near the bottom\n  * repeated vendor header/logo\n  * continuation pages with line-item tables but no new invoice header\n\n\n\nGoogle’s Custom Splitter is built around exactly this use case: composite files containing multiple logical documents that then get routed to the appropriate extractor. (Google Cloud Documentation)\n\n### 2. Use native PDF text before OCR when possible\n\nIf a page is born-digital, extract words and blocks directly from the PDF first. PyMuPDF recommends block and word extraction because plain text order may be wrong, and `Page.get_text(\"blocks\")` / `Page.get_text(\"words\")` preserve useful position information. (PyMuPDF)\n\n### 3. Use document OCR only for scanned pages\n\nFor scanned pages or images, use invoice/document AI OCR rather than generic OCR-only tooling. Azure’s invoice model is built to handle phone captures, scanned documents, and digital PDFs, and returns recognized text, tables, and invoice-specific fields plus line items. AWS Textract’s invoice/receipt path similarly outputs structured summary fields and line items instead of one text blob. (Microsoft Learn)\n\n### 4. Keep coordinates in your intermediate data\n\nFor each word, keep:\n\n  * page number\n  * text\n  * bounding box\n  * line ID\n  * block ID\n  * confidence\n  * source type: native PDF or OCR\n\n\n\nThis is what lets you ask useful questions like “what is near the invoice-number label?” instead of “does the whole OCR blob contain the pattern?”\n\n### 5. Zone the page before extracting fields\n\nSplit each invoice into approximate regions:\n\n  * header\n  * vendor/bill-to area\n  * line-item area\n  * totals area\n  * footer/remittance area\n\n\n\nThen only search:\n\n  * invoice number and date in the **header**\n  * shipping/freight/discount/tax/total in the **totals area**\n  * products, qty, price, amount in the **line-item area**\n\n\n\nThis mirrors how invoice parsers expose output: Azure returns text, tables, and invoice-specific fields; AWS separates summary fields and line items. (Microsoft Learn)\n\n### 6. Treat charges as labeled totals lines\n\nInside the totals block, extract a list of labeled amount lines:\n\nRaw label | Internal field\n---|---\nShipping | `shipping_charge`\nShipping & Handling | `shipping_charge` or split later\nFreight | `freight_charge`\nDiscount | `invoice_level_discount`\nRebate | `invoice_level_discount`\n\nBecause your accounting system distinguishes freight from shipping, do **not** collapse them automatically.\n\n### 7. Reconstruct line items separately\n\nDo not use header-field logic for line items.\n\nFor line items, use a table or pseudo-table approach:\n\n  * detect numeric columns on the right\n  * group words into rows by vertical overlap\n  * treat left text as description\n  * merge multiline descriptions when there is no new numeric anchor\n\n\n\nThat is where invoice extraction usually becomes hard.\n\n## Best practical options\n\n### Fastest path\n\nBenchmark a purpose-built invoice parser first.\n\nGood starting options are:\n\n  * Google Document AI: **Custom Splitter + Invoice Parser + uptraining/custom fields**. Google explicitly says you can uptrain the Invoice Parser with your own data and add custom fields that are not supported by the pretrained model. That is directly useful for a field like `freight_charge`. (Google Cloud Documentation)\n  * AWS Textract `AnalyzeExpense`: it already standardizes `DISCOUNT` and `SHIPPING_HANDLING_CHARGE`, plus summary fields and line items. (AWS Document)\n  * Azure Document Intelligence invoice model: it handles scanned images, PDFs, and line items in structured JSON. (Microsoft Learn)\n\n\n\n### Strong custom path\n\nIf you want to own the stack:\n\n  * split packed PDFs first\n  * use PyMuPDF blocks/words for digital PDFs\n  * use OCR only for scanned pages\n  * keep coordinates\n  * zone header/totals/items separately\n  * extract local candidates near labels\n  * normalize freight, shipping, discount, tax\n  * reconcile the math before posting anything\n\n\n\n## My concrete advice for you\n\nFor your situation, I would do this in order:\n\n### Phase 1\n\nTake 30 to 50 invoices from the packed PDF and manually create a small gold set:\n\n  * correct invoice boundaries\n  * correct invoice number\n  * subtotal\n  * discount\n  * shipping\n  * freight\n  * tax\n  * total\n  * amount due\n\n\n\n### Phase 2\n\nTest two paths:\n\n  * **managed invoice parser with splitting**\n  * **native PDF extraction + local rules** on already-split invoices\n\n\n\nThat will tell you very quickly whether your real bottleneck is:\n\n  * split detection\n  * OCR quality\n  * reading order\n  * totals parsing\n  * line-item grouping\n\n\n\n### Phase 3\n\nLock your schema before tuning models:\n\n  * `subtotal`\n  * `line_item_discount_total`\n  * `invoice_level_discount`\n  * `shipping_charge`\n  * `freight_charge`\n  * `tax_total`\n  * `invoice_total`\n  * `amount_due`\n\n\n\n### Phase 4\n\nAdd reconciliation rules and reject anything that does not balance.\n\n## Bottom line\n\nYour earlier failure does **not** mean invoice extraction is a bad fit.\n\nIt means the earlier workflow was fragile:\n\n  * too many invoices in one PDF\n  * flattened OCR text\n  * regex dependent on OCR order\n\n\n\nA stronger workflow is:\n\n**packed PDF → split into invoices → extract with coordinates → parse header/totals/lines separately → keep freight separate from shipping → keep discounts explicit → validate the math**\n\nThat is the path I would take. (Google Cloud Documentation)",
  "title": "Invoice Data Recognition"
}