Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreid4h4old7amvbebtnoicok2m6ja5kupca522to5hvnaxlaus7ooce",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mivnjuxrd2p2"
  },
  "path": "/t/how-to-build-custom-key-value-extraction-similar-to-azure-document-intelligence/175015#post_3",
  "publishedAt": "2026-04-07T03:17:39.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "Microsoft Learn",
    "GitHub",
    "Hugging Face",
    "Label Studio",
    "arXiv"
  ],
  "textContent": "Currently, one of the challenges with using open-source solutions for OCR and related tasks is that they often do not exist in the same form as commercial services, which provide a comprehensive, all-in-one package.\n\nIn many cases, while there are plenty of suitable models and libraries available as open-source software for specific tasks, you still need to build the pipeline yourself, and the question is whether that aligns with your requirements and budget (including the effort involved).\n\nRelying on a single model would result in extremely high computational costs (even if feasible, it would involve a lot of computational waste). I suspect that commercial services also use a pipeline structure internally, but the specifics are, of course, unknown…\n\n* * *\n\nYou can build something very close to Azure Document Intelligence with open-source tools, but the right design is **not** “one model that magically reads every document.” The closest open-source equivalent is a **schema-driven pipeline** : define the fields you care about, annotate examples, train an extractor for that schema, route different document families to different specialists, then normalize and validate the outputs. That is also how Azure frames custom extraction: label the values you want, train on that labeled set, and split or compose models when formats differ a lot. (Microsoft Learn)\n\n## The first decision that matters\n\nYour problem is probably one of these three:\n\n  1. **Fixed-schema extraction**\nYou already know the fields. Example: `invoice_number`, `invoice_date`, `vendor_name`, `total_amount`.\n\n  2. **Generic key-value discovery**\nThe system must find arbitrary keys and match them to arbitrary values, even when the field names were not predefined.\n\n  3. **Document QA**\nYou ask a question per field. Example: “What is the invoice number?”\n\n\n\n\nMost Azure-style custom extraction use cases are really **fixed-schema extraction**. If that is your case, do **not** start with the hardest problem. Public issues around LayoutLM-style models show the common trap: people can get token labels or separate “question” and “answer” regions, but then get stuck turning those outputs into reliable key-value JSON. The relation-extraction step is where many implementations become messy. (GitHub)\n\n## What to build\n\nBuild this stack:\n\n  1. **Document router**\nDetect the document family first: invoice, receipt, claim form, onboarding form, statement, and so on.\n\n  2. **OCR + layout extraction**\nExtract text, boxes, reading order, page numbers, and optionally tables.\n\n  3. **Field extractor**\nTrain a model that predicts only your target fields.\n\n  4. **Post-processing**\nNormalize dates, currency, IDs, addresses, totals, and line items.\n\n  5. **Human review**\nSend low-confidence fields to a reviewer.\n\n\n\n\nThis is the closest open-source equivalent to Azure’s custom template/custom neural plus composed-model workflow. Azure explicitly recommends segmenting divergent templates and composing models because mixing very different formats can reduce accuracy. (Microsoft Learn)\n\n## Which model family to choose\n\n### Best default: OCR + layout-aware models\n\nThis is the safest starting point for most business documents.\n\n**LayoutLMv3** is the most practical default. It is designed for Document AI and combines text, layout, and image information. In practice, it is widely used for token classification style extraction on forms, receipts, and invoices. (Hugging Face)\n\n**BROS** is especially relevant if key-to-value linking matters. Hugging Face exposes both an entity extraction head and an entity linking head for BROS, which is unusual and directly useful for key information extraction. (Hugging Face)\n\n**LayoutXLM** is the multilingual option. Its docs highlight the XFUN benchmark, which includes manually labeled key-value pairs in seven languages. If your documents are multilingual, LayoutXLM is a strong candidate. (Hugging Face)\n\n**LiLT** is another multilingual option. It is designed for structured document understanding across languages by combining layout information with a language-specific text encoder. (Hugging Face)\n\n### OCR-free option: Donut\n\n**Donut** is an OCR-free document understanding model. Instead of running OCR first, it reads document images directly and can be fine-tuned to emit structured output such as JSON. This is attractive when OCR quality is poor or when you want direct image-to-structure prediction. (Hugging Face)\n\n### Alternative formulation: Document QA\n\nHugging Face has a dedicated **Document Question Answering** task. This lets you treat each field as a question:\n\n  * What is the invoice number?\n  * What is the due date?\n  * What is the total amount?\n\n\n\nThis is often the fastest route when the number of required fields is small or moderate and the layouts vary a lot. (Hugging Face)\n\n## What I would recommend for you\n\nFor an Azure-like system, start with **OCR + LayoutLMv3** , not Donut, not generic key-value discovery, and not a research-heavy relation-extraction pipeline.\n\nWhy:\n\n  * It matches Azure’s schema-first workflow well. (Microsoft Learn)\n  * It is easier to debug because you can inspect OCR text, bounding boxes, and bad spans. This is a practical inference supported by the structure of OCR-first tooling and the public issues around “how do I turn this into JSON?” (GitHub)\n  * It lets you label the **target values directly** , which is simpler than solving generic key/value linking. (GitHub)\n\n\n\nA good mental model is:\n\n  * **LayoutLMv3** for direct field extraction\n  * **BROS** if linking becomes the main bottleneck\n  * **LayoutXLM or LiLT** if multilingual support is important\n  * **Donut** if OCR is the main failure source\n  * **Document QA** if you only need a limited number of fields and want flexible schema growth (Hugging Face)\n\n\n\n## The OCR layer\n\nDo not underestimate OCR quality. If OCR is weak, the extractor will look weak.\n\nTwo good open-source OCR front ends are:\n\n  * **Surya** : OCR in 90+ languages, line-level text detection, layout analysis, reading order detection, and table recognition. This makes it a strong front end for document pipelines. (GitHub)\n  * **docTR** : end-to-end OCR with a two-stage detection + recognition pipeline. It is simpler and focused. (GitHub)\n\n\n\nMy recommendation:\n\n  * Choose **Surya** if your documents are varied, multilingual, multipage, or table-heavy.\n  * Choose **docTR** if you want a lighter OCR component and will build the rest yourself. (GitHub)\n\n\n\n## How to annotate your data\n\nFor your case, I would **not** start by labeling generic `KEY` and `VALUE` plus relations.\n\nI would label the **target values directly**. Example:\n\n  * `INVOICE_NUMBER`\n  * `INVOICE_DATE`\n  * `VENDOR_NAME`\n  * `CUSTOMER_NAME`\n  * `SUBTOTAL`\n  * `TAX`\n  * `TOTAL_AMOUNT`\n\n\n\nWhy this is better:\n\n  * Your downstream system wants those exact fields.\n  * You avoid a second pairing problem.\n  * You avoid the public pain point that LayoutLM-style outputs do not automatically become a final dictionary. (GitHub)\n\n\n\nFor annotation tooling, **Label Studio** is a good fit. Its PDF OCR template supports multi-page PDFs, normalized coordinates, rotation, page index, and editable OCR text per region. If you later need explicit relation labels, Label Studio also supports relation-style annotation patterns. (Label Studio)\n\n## Recommended training workflow\n\n### 1. Define the schema\n\nStart with a small schema. Example:\n\n\n    {\n      \"invoice_number\": null,\n      \"invoice_date\": null,\n      \"vendor_name\": null,\n      \"subtotal\": null,\n      \"tax\": null,\n      \"total_amount\": null\n    }\n\n\nDo not try to “extract everything” first. Azure also starts from labeled target values, not from open-ended document understanding. (Microsoft Learn)\n\n### 2. Split documents into families\n\nDo not train one model on invoices, receipts, bank statements, and forms all at once unless they are visually very similar. Azure explicitly recommends splitting different formats and composing models when needed. (Microsoft Learn)\n\n### 3. Run OCR and layout extraction\n\nUse Surya or docTR to produce:\n\n  * words or lines\n  * bounding boxes\n  * page index\n  * reading order\n  * table structure if needed (GitHub)\n\n\n\n### 4. Convert labels to model format\n\nFor LayoutLMv3-style training, you will align OCR words and boxes to tokenized inputs. This is one place where many beginners fail. There are public forum posts showing word labels no longer matching token labels after subword splitting. (GitHub)\n\n### 5. Train a direct field extractor\n\nUse **token classification** first. That means the model predicts field labels over tokens or words. This is much simpler than generic relation extraction. LayoutLMv3 is well suited to this. (Hugging Face)\n\n### 6. Add post-processing\n\nThis is not optional. Add rules for:\n\n  * date parsing\n  * currency normalization\n  * numeric cleanup\n  * ID regexes\n  * duplicate resolution\n  * confidence thresholds\n\n\n\nWithout this layer, even a good model will feel brittle. This is an engineering recommendation, but it follows directly from the gap between raw model spans and final business-ready fields seen in public issues. (GitHub)\n\n### 7. Evaluate at the field level\n\nDo not rely only on token F1. Track:\n\n  * exact match by field\n  * normalized exact match\n  * document-level pass rate\n  * review rate\n  * optional pair-level metrics if you later add linking\n\n\n\nThis is especially important because relation quality and grouped extraction quality matter more than raw token labeling in production-style systems. (arXiv)\n\n## When to use relation extraction\n\nOnly add a second relation-extraction stage if your documents truly require it.\n\nUse it when:\n\n  * the same field appears multiple times in local groups\n  * field names vary widely\n  * there are many repeated key-value blocks\n  * you need generic key-value discovery rather than fixed-schema extraction\n\n\n\nPaddleOCR’s KIE documentation is a good example of this architecture: first run **SER** to detect the key/value spans, then run **RE** to match keys and values. (GitHub)\n\nBut relation extraction also adds complexity. Public issues and discussions show that this is exactly where many users get stuck. (GitHub)\n\n## When Donut is a better fit\n\nUse Donut if these are true:\n\n  * OCR errors dominate your current failures\n  * document layouts are visually consistent\n  * you want direct structured output\n  * you are willing to validate generated JSON carefully\n\n\n\nDonut is powerful, but for a first Azure-like replacement, I would still start with an OCR-first system because it is easier to inspect and maintain. (Hugging Face)\n\n## Public code and tutorials worth using\n\nThese are practical starting points:\n\n  * **Hugging Face model docs** for LayoutLMv3, BROS, Donut, LayoutXLM, LiLT, and Document QA. (Hugging Face)\n  * **Phil Schmid’s`document-ai-transformers`** repo. It includes tutorials for fine-tuning Donut, LayoutLM, and LiLT, plus data-processing examples. (GitHub)\n  * **Niels Rogge’s tutorials and issues** , which are useful because they expose the actual implementation pain points, not just the happy path. (GitHub)\n  * **PaddleOCR KIE** docs if you later need a SER + RE pipeline. (GitHub)\n\n\n\n## Practical recommendation stack\n\nIf I had to choose one first version for you, I would build this:\n\n  * **OCR** : Surya\n  * **Annotation** : Label Studio PDF OCR template\n  * **Extractor** : LayoutLMv3 token classification\n  * **Routing** : one classifier per document family\n  * **Post-processing** : rules + confidence thresholds\n  * **Fallback** : human review for low-confidence fields (GitHub)\n\n\n\nThen I would evolve it like this:\n\n  * If multilingual: test **LayoutXLM** or **LiLT**. (Hugging Face)\n  * If linking becomes the main issue: test **BROS** or a PaddleOCR SER+RE pipeline. (Hugging Face)\n  * If OCR is the main issue: test **Donut**. (Hugging Face)\n  * If the field set is small and changes often: add a **Document QA** baseline in parallel. (Hugging Face)\n\n\n\n## Bottom line\n\nThe cleanest answer is:\n\n  * **Do not start with generic key-value discovery**\n  * **Do not start with one universal model**\n  * **Do start with a schema-driven OCR-first pipeline**\n  * **Do label target fields directly**\n  * **Do split document families early**\n  * **Do add post-processing early**\n\n\n\nThat is the closest open-source equivalent of Azure Document Intelligence custom extraction, and it is the path with the best balance of accuracy, debuggability, and maintainability. (Microsoft Learn)",
  "title": "How to build custom key-value extraction (similar to Azure Document Intelligence)?"
}