Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreiabrpl3uz3xrpxn4kzvi7dalktkefe3k6kh4nbn56yl5bkzba6zgm",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mhsbtexb4pe2"
  },
  "path": "/t/invoice-data-recognition/174564#post_2",
  "publishedAt": "2026-03-24T07:09:18.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "AWS Document",
    "Hugging Face",
    "Google Cloud Documentation",
    "mindee.github.io",
    "Microsoft Learn",
    "GitHub",
    "arXiv",
    "Google Cloud",
    "PyMuPDF",
    "Invoice2data"
  ],
  "textContent": "While there are plenty of good existing OCR models, you shouldn’t expect a single model to work well on its own when dealing with extremely messy invoices. It’s better to use them in combination.\n\nHow heavy the OCR model or other models need to be depends on just how messy the invoices are…\n\n* * *\n\nBuild it as a **document understanding pipeline** , not as a plain NER model.\n\nThat is the main recommendation.\n\nInvoices are visually structured documents. They contain header fields, totals blocks, and line-item tables. That is why the mature invoice systems from AWS, Google, and Azure all treat invoices as a mix of **summary fields** and **line items** , not as one flat sequence-labeling problem. AWS Textract returns `SummaryFields` and `LineItemGroups`. Google’s Invoice Parser extracts both header and line-item fields. Azure’s invoice model also extracts key fields plus line items in structured JSON. (AWS Document)\n\n## Why plain NER is not enough\n\nClassic NER assumes text is mostly a sequence. In invoices, meaning depends heavily on **position** and **grouping** :\n\n  * the same number can be a unit price, tax amount, subtotal, or total\n  * a product description may wrap across multiple lines\n  * values belonging to one row may be far apart in reading order but aligned visually\n  * totals often appear multiple times in different boxes\n\n\n\nThat is why document models such as LayoutLMv3 use both text and image/layout information, and why DocILE evaluates **Key Information Localization and Extraction** separately from **Line Item Recognition**. Line-item recognition exists as a separate task because finding fields is easier than grouping them into the correct item rows. (Hugging Face)\n\n## The right way to think about the problem\n\nYour real goal is not only “tag supplier, products, prices, total.”\n\nYour real goal is:\n\n  1. read the invoice correctly\n  2. recover its structure\n  3. normalize the values\n  4. validate the math\n  5. map the result to accounting categories\n\n\n\nSo I would split the system into **two major outputs** :\n\n### Output A. Structured invoice extraction\n\nThis produces:\n\n  * supplier name\n  * supplier tax ID\n  * invoice number\n  * invoice date\n  * due date\n  * currency\n  * subtotal\n  * tax\n  * total\n  * line items\n\n\n\n### Output B. Accounting decision\n\nThis uses the structured output to predict:\n\n  * GL account\n  * expense category\n  * tax code\n  * cost center\n  * approval or exception flags\n\n\n\nThat separation is important. Extraction answers “what is on the invoice.” Accounting classification answers “how finance should code it.” Google’s Document AI flow reflects this distinction too: you can use a pretrained invoice parser and then uptrain it with your own fields and data when the generic parser is not enough. (Google Cloud Documentation)\n\n## The architecture I would recommend\n\n### 1. Ingestion and document triage\n\nFirst decide what kind of file you have:\n\n  * born-digital PDF with selectable text\n  * scanned PDF\n  * photo or image\n  * multi-page mixed document\n\n\n\nFor born-digital PDFs, extract the text layer and coordinates first. For scans or photos, run OCR. A hybrid setup is better than forcing every document through image OCR, because clean PDF text is usually more accurate than OCR. OCR stacks such as docTR are useful here because they return localized word predictions, not just a plain string. (mindee.github.io)\n\n### 2. Layout zoning\n\nBefore extracting fields, segment the page into likely regions:\n\n  * header\n  * addresses\n  * line-item region\n  * totals region\n  * footer\n\n\n\nThis makes the rest of the system much easier. If you can isolate the totals area from the item table, you reduce many false assignments immediately. Azure’s layout and invoice tooling explicitly emphasizes extracting text and layout information from documents, not only OCR text. (Microsoft Learn)\n\n### 3. Header-field extraction\n\nFor fields like supplier, invoice number, dates, subtotal, tax, and total, use a **layout-aware extractor**.\n\nA strong open baseline is **LayoutLMv3**. It is built for Document AI and combines text and image signals. This is much better suited to invoices than plain token NER because it can use both wording and spatial placement. (Hugging Face)\n\n### 4. Line-item extraction\n\nThis is the hardest part, and it deserves its **own subsystem**.\n\nUse two modes:\n\n#### Mode 1. True table mode\n\nWhen the invoice has a clear table, use a table detector and structure recognizer. **Table Transformer** is a good open-source building block here, and its official repository is also the home of PubTables-1M and the GriTS metric. (GitHub)\n\n#### Mode 2. Implicit table mode\n\nMany invoices do not have a clean bordered table. They use whitespace alignment. In that case:\n\n  * find right-aligned numeric columns first\n  * cluster text boxes by vertical overlap into candidate rows\n  * treat left text as description\n  * merge rows when the description continues but no new numeric anchors appear\n  * carry row state across pages if the table continues\n\n\n\nThis is where many projects fail. DocILE’s separate line-item task is strong evidence that row grouping is not just post-processing noise. It is a central modeling problem. (arXiv)\n\n### 5. Normalization\n\nConvert raw text into canonical values:\n\n  * dates to ISO format\n  * amounts to decimals\n  * currency to a standard code\n  * supplier names to canonical vendor IDs\n\n\n\nExample:\n\n  * `1.234,56` and `1,234.56` should become the same internal number\n  * `ACME Ltd.` and `ACME LIMITED` should map to the same vendor entity\n\n\n\nThis step is critical for downstream accounting, duplicate detection, and analytics. The commercial invoice parsers all return structured values because raw OCR text is not enough for workflow automation. (Microsoft Learn)\n\n### 6. Validation and reconciliation\n\nThis is the most important non-model part of the system.\n\nDo not trust extraction because it “looks right.” Trust it only if it reconciles.\n\nChecks should include:\n\n  * sum(line amounts) ≈ subtotal\n  * subtotal + tax + shipping − discount ≈ total\n  * quantity × unit price ≈ line amount\n  * currency is consistent across the document\n  * page-level totals do not get mixed into line items\n\n\n\nThis is not just cleanup. It is your error detector. KIEval makes the same broad point from an evaluation perspective: industrial document extraction must assess **grouped structured information** , not just isolated entities. (arXiv)\n\n### 7. Accounting classification\n\nOnly after the invoice is structured should you predict accounting labels.\n\nInputs can include:\n\n  * canonical vendor\n  * line-item descriptions\n  * extracted tax rate\n  * amount ranges\n  * vendor history\n  * previous GL mappings for similar items\n\n\n\nStart simple. A gradient boosting model or logistic regression over engineered features can work surprisingly well once the document is already structured. You do not need a giant end-to-end model for the accounting part on day one. That is a reasoning recommendation, supported by the fact that major invoice systems focus first on structured extraction and then on downstream workflow integration. (Google Cloud)\n\n## Best model choices\n\nYou have three realistic routes.\n\n### Route 1. Managed parser first\n\nUse AWS Textract, Google Document AI, or Azure Document Intelligence as a production baseline.\n\nThis is the fastest way to get a working benchmark because those systems already parse invoices into header fields and line items. Google also supports uptraining its pretrained invoice processor on your own data. (AWS Document)\n\nThis route is best when:\n\n  * you need speed\n  * labeling data is limited\n  * your differentiation is in accounting logic, not OCR research\n\n\n\n### Route 2. Modular open-source stack\n\nThis is the route I would recommend if you want control.\n\nA solid stack is:\n\n  * OCR: docTR or equivalent\n  * header extractor: LayoutLMv3\n  * line-item structure: Table Transformer\n  * rules: normalization + validation\n\n\n\nThis combination matches the actual structure of the problem. docTR handles word localization and recognition. LayoutLMv3 handles layout-aware field extraction. Table Transformer handles table structure. (mindee.github.io)\n\n### Route 3. End-to-end document parser\n\nIf you want to benchmark a modern page-to-structured-output model, try an OCR-free or integrated document model such as **Donut** , **PaddleOCR-VL-1.5** , or **GLM-OCR**. Donut is explicitly OCR-free. PaddleOCR-VL-1.5 and GLM-OCR are current document-parsing models on Hugging Face focused on complex document understanding. (Hugging Face)\n\nThis route is attractive for fast prototyping, but I would still keep explicit validation and line-item logic around it. End-to-end models are useful front ends. They should not be the only safety mechanism in an accounting workflow. (Hugging Face)\n\n## What data to label\n\nStart with a small, useful schema. Do not annotate 80 fields first.\n\n### Phase 1 fields\n\n  * supplier_name\n  * invoice_id\n  * invoice_date\n  * due_date\n  * currency\n  * subtotal\n  * tax_amount\n  * total_amount\n\n\n\n### Phase 2 line items\n\n  * description\n  * quantity\n  * unit\n  * unit_price\n  * line_amount\n  * tax_rate\n\n\n\n### Phase 3 accounting labels\n\n  * vendor_id\n  * GL_account\n  * tax_code\n  * cost_center\n\n\n\nFor public benchmarks and prototyping, **DocILE** is the closest fit to your problem because it is built on business documents and includes line-item recognition. **FUNSD** and **CORD** are useful smaller sets for form understanding and receipt-style parsing, but DocILE is the strongest conceptual match for invoices. (arXiv)\n\n## How to evaluate it\n\nDo not evaluate only token F1.\n\nUse at least four evaluation layers:\n\n### 1. Field accuracy\n\nExact or normalized match for supplier, invoice number, dates, subtotal, tax, and total. (Microsoft Learn)\n\n### 2. Line-item grouping accuracy\n\nDid the right quantity, price, and amount end up in the same row? This is exactly the kind of structure KIEval argues should be evaluated explicitly. (arXiv)\n\n### 3. Reconciliation pass rate\n\nWhat percentage of invoices pass your arithmetic checks with no human correction? This is one of the best business metrics for this use case. It is a design recommendation, supported by the structured nature of invoice outputs and grouping-sensitive evaluation. (AWS Document)\n\n### 4. Vendor-split testing\n\nHold out vendors or layouts, not only random pages. DocILE’s test design includes zero-shot and few-shot layouts, which reflects the real production risk: overfitting to common templates. (arXiv)\n\n## Practical pitfalls\n\nThe common failure modes are predictable:\n\n  * multi-line descriptions split into fake extra rows\n  * subtotal mistaken for grand total\n  * unit price mistaken for line amount\n  * tables continuing across pages\n  * tax-inclusive vs tax-exclusive pricing\n  * duplicate totals in summary boxes\n  * supplier name variants\n  * OCR noise on phone photos\n\n\n\nThis is why hybrid systems remain strong in practice. Template-based tools such as `invoice2data` still matter because they support line-item and table plugins, multiple regex patterns per field, and custom organizational fields. That is a useful reminder that deterministic rules still have value, especially for high-volume repeat vendors. (GitHub)\n\n## What I would do first\n\nIf I were building this from scratch, I would do it in this order:\n\n  1. **Build a canonical invoice schema**\n  2. **Implement PDF text extraction plus OCR fallback**\n  3. **Add layout zoning**\n  4. **Train or fine-tune a header extractor**\n  5. **Build a dedicated line-item subsystem**\n  6. **Add normalization and reconciliation**\n  7. **Only then add the accounting classifier**\n\n\n\nThat sequence gives you a system that is explainable, measurable, and safe enough for financial workflows. It also mirrors how the mature invoice platforms structure the task. (AWS Document)\n\n## Bottom line\n\nA plain NER model is **not** the best framing.\n\nThe better framing is:\n\n**Document OCR/parsing → layout understanding → header extraction → line-item extraction → normalization → validation → accounting classification**\n\nIf you want the most practical custom baseline, I would choose:\n\n  * **docTR** for OCR\n  * **LayoutLMv3** for header fields\n  * **Table Transformer** for line items\n  * **rules plus reconciliation** for acceptance\n  * **a separate classifier** for the accounting chart\n\n\n\nIf you want the fastest benchmark, compare that against one managed parser such as Google, Azure, or AWS. (mindee.github.io)\n\n* * *\n\nUse the matrix below by asking one question first:\n\n**What is the dominant failure mode in my invoices?**\n\nThat is the right selector. Mature invoice parsers already split the job into **key fields** and **line items** , so one baseline rarely fits every invoice population. AWS exposes `SummaryFields` and `LineItemGroups`, Google’s Invoice Parser extracts both header and line-item fields, and Azure’s invoice model does the same. (AWS Document)\n\n## Side-by-side decision matrix\n\nDominant case | Recommended custom baseline | Why this baseline | Labels / data needed | Cheapest first experiment | Most likely failure mode | Source anchors\n---|---|---|---|---|---|---\n**Mostly clean, born-digital PDFs** | **PyMuPDF or pdfplumber → region rules → Table Transformer → reconciliation** | Best when OCR is unnecessary. PyMuPDF can extract word boxes and tables directly. pdfplumber is built for detailed PDF geometry, table extraction, and visual debugging, and works best on machine-generated PDFs. | Very little labeled data at first. Often enough to start with regex/rules plus a few manually checked examples. | Run native PDF extraction on 50 invoices. Compare field coverage and line-item recovery before adding any OCR. | Hidden reading-order issues, merged text blocks, whitespace-only tables. | (PyMuPDF)\n**Scans, phone photos, skew, warping** | **PP-DocLayoutV3 → GLM-OCR → Table Transformer → reconciliation** | Good when geometry is the problem, not just text recognition. PP-DocLayoutV3 is designed for non-planar documents and reading order. GLM-OCR is a current multimodal OCR model for complex document understanding. | Small labeled set for validation is enough to start. Stronger gains come from representative distorted samples. | Benchmark 30 hard pages with and without the layout stage. Measure row recovery, not just OCR text quality. | Curved pages, bad lighting, over-segmentation, wrong reading order. | (Hugging Face)\n**Few labels, many repeat vendors** | **invoice2data templates + PDF/OCR backend + vendor normalizers** | Strong when the same vendor layouts repeat. `invoice2data` supports templates, static fields, and plugins for line items and tables. | Very low ML labeling need. You mainly need clean templates and vendor-specific cleanup rules. | Template the top 10 vendors that make up most volume. Track touchless rate before training anything. | Template drift, unseen vendors, multiline descriptions that break template assumptions. | (Invoice2data)\n**Header fields are fine, line items are the blocker** | **PDF text or OCR → Table Transformer-first pipeline → row repair rules** | Best when supplier/date/total extraction is mostly solved but row grouping is not. Table Transformer is explicitly for table detection and structure recognition from PDF images. | Need labeled line items more than labeled headers. Focus annotation on row grouping and numeric columns. | Evaluate on 100 invoices using only line-item metrics: row grouping, numeric binding, subtotal reconciliation. | Wrapped descriptions, implicit tables, page breaks, missing cell boundaries. | (Hugging Face)\n**You want one end-to-end trainable parser baseline** | **Donut → schema normalization → reconciliation** | Good as a clean benchmark for “how far can one model go?” Donut is OCR-free and directly maps document images to structured outputs. | Needs paired page→target-schema examples. Best when you can supervise against JSON-like targets. | Fine-tune on a narrow schema first: supplier, invoice ID, date, subtotal, total, and one simple line-item format. | Hallucinated structure, unstable long outputs, weak row grouping on dense tables. | (Hugging Face)\n**You want a current open document-parser baseline** | **PaddleOCR-VL-1.5 → schema normalization → reconciliation** | Good zero-shot benchmark when layouts vary a lot. The current model card positions it as a 0.9B document parser with strong table/text performance and robustness to scanning, skew, warping, screen photography, and illumination. | Little task-specific labeling needed to start. You mainly need a holdout set for honest evaluation. | Run it on a representative vendor mix and compare only business outputs: valid totals, valid rows, review rate. | Great parse text but imperfect field binding, overconfident outputs on rare layouts. | (Hugging Face)\n**You want a fuller parsing system with less assembly work** | **PP-StructureV3 → custom field mapping → reconciliation** | Good when your goal is end-to-end document parsing rather than assembling many separate tools. PP-StructureV3 is presented as a document parsing solution that converts PDFs and document images to Markdown and JSON. | Moderate. You still need business-specific mapping and validation, but less low-level pipeline glue. | Use it as a parser front end, then map its structure into your invoice schema and test on 20 messy multi-page invoices. | General parser output that is structurally rich but not yet aligned to accounting fields. | (Hugging Face)\n**You want the cheapest serious sanity-check baseline** | **PyMuPDF/pdfplumber + regex/keywords for headers + column heuristics for lines + reconciliation** | Best for finding out whether ML is even needed yet. If native PDF text and coordinates already solve most of the problem, you learn that before investing in training. | Almost none initially. You need sample invoices and manual error review. | Try on 100 digital PDFs. Count how many pass field extraction and arithmetic checks with no ML. | Fails badly on scans, implicit multi-line rows, and vendor layouts with weak alignment. | (PyMuPDF)\n\n## How to choose fast\n\nIf your invoices are mostly:\n\n  * **digital PDFs** → start with **PyMuPDF/pdfplumber**\n  * **photos or distorted scans** → start with **PP-DocLayoutV3 + GLM-OCR**\n  * **repeat vendors with low label budget** → start with **invoice2data**\n  * **line-item-heavy** → start with a **Table Transformer-first** stack\n  * **mixed layouts and you want a modern open benchmark** → start with **PaddleOCR-VL-1.5**\n  * **one-model benchmark** → try **Donut** as the clean end-to-end baseline (GitHub)\n\n\n\n## Default pick if you are unsure\n\nIf you do not yet know your dominant failure mode, I would test in this order:\n\n  1. **PyMuPDF/pdfplumber baseline** on digital PDFs\n  2. **PaddleOCR-VL-1.5** as the modern open parser benchmark\n  3. **Table Transformer-first** for line items\n  4. **invoice2data** for high-volume repeat vendors\n  5. **Donut** only as the end-to-end control baseline (PyMuPDF)\n\n",
  "title": "Invoice Data Recognition"
}