Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreidsbsglb7bruae25rkde4ca4xuwuvjed2235ocrxlmsttlrrmdtiu",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mign75k4ysu2"
  },
  "path": "/t/invoice-data-recognition/174564#post_6",
  "publishedAt": "2026-04-01T10:57:29.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "Google Cloud Documentation",
    "PyMuPDF",
    "GitHub",
    "OCRmyPDF",
    "Google Cloud"
  ],
  "textContent": "> I know some Python but have not done AI or OCR with Python before.\n\nOh. PyTorch, Transformers, and other libraries handle GPU-related tasks and the acceleration of bottleneck processes, effectively wrapping them up for us.\n\nAs a result, there isn’t much difference in how you actually use standard Python functions (methods) versus functions designed for AI models. The only real precaution is to ensure the hardware isn’t busy when calling AI functions; other than that, there aren’t many other things to watch out for. I think it’s easier than functions related to disk I/O…\n\nHowever, generally speaking, even the lightest general-purpose AI models available on HF are somewhat heavier than plain code, so it might be best to create a prototype without AI first. The fastest approach is to avoid using AI unless absolutely necessary.\n\n* * *\n\nFor your **first learning pass** , starting with **text blocks or paragraph-like chunks** is a good idea. But for the **shipper summary PDF you just described** , I would change one word:\n\nDo **not** think in terms of paragraphs.\nThink in terms of **repeating summary blocks on a page**.\n\nThat is the key difference.\n\nIf one page contains multiple shipment summaries, then the first problem is not “extract all text.” The first problem is **segment the page into summary-sized regions** , then extract each region separately. A page-level splitter alone will not solve that, because Google’s Custom Splitter is designed to identify logical documents in composite files and return page-level document boundaries; if multiple summaries live on the same page, you still need an **intra-page segmentation** step. (Google Cloud Documentation)\n\n## What I would do first\n\nFor your first implementation, I would **not start with AI**.\n\nI would start with a simple Python workflow using **native PDF text extraction** , because if these shipper PDFs are machine-generated, that is usually easier and more reliable than OCR. PyMuPDF can extract text as **blocks** and **words** , and its docs explicitly note that plain text may not come out in natural reading order, while block/word extraction and sorting help recover usable structure. pdfplumber is also built for detailed PDF inspection and says it works best on **machine-generated PDFs**. (PyMuPDF)\n\nSo the beginner-friendly path is:\n\n  1. open one PDF page\n  2. extract **blocks** or **words with coordinates**\n  3. detect the repeated shipment-summary regions on that page\n  4. extract text **inside each region**\n  5. parse one region at a time\n\n\n\nThat is much easier than OCR-first AI work, and it teaches the right workflow. (PyMuPDF)\n\n## Why your previous regex workflow failed\n\nYour regex was probably not the main problem.\n\nThe main problem was that OCR or plain PDF extraction often returns text in an order that is not the order your regex expects. PyMuPDF’s docs say the output of plain text extraction may not match natural reading order, and they provide `sort=True` plus block/word extraction specifically to help with this. (PyMuPDF)\n\nSo instead of searching a giant text blob for:\n\n\n    invoice number \\d{4,6}\n\n\nyou want to search **inside one detected summary region** , and only then look for local label/value pairs.\n\nThat is a very different workflow.\n\n## The right mental model for your shipper summary PDF\n\nBecause there is **one charge per shipment** and **multiple summaries per page** , your first real task is probably this:\n\n**page → repeated summary boxes/cards → one structured record per box**\n\nNot:\n\n**page → paragraphs**\n\nThat matters because your data sounds closer to a **repeating form layout** than to a narrative document.\n\nSo I would define one shipment-summary record like this:\n\n  * shipper account\n  * shipment date\n  * tracking number or shipment reference\n  * invoice number or summary number\n  * base charge\n  * shipping charge\n  * freight charge\n  * discount\n  * tax\n  * total\n\n\n\nThen repeat that extraction for every summary block on the page.\n\n## The easiest MVP\n\nI would build the MVP in four steps.\n\n### Step 1: decide whether you even need OCR\n\nTry native PDF extraction first.\n\nUse **PyMuPDF** or **pdfplumber** on a few sample pages and inspect whether the text comes out clean enough. pdfplumber explicitly says it works best on machine-generated PDFs, and PyMuPDF exposes blocks, words, and rectangles you can search within. (GitHub)\n\nIf that works, you just saved yourself a lot of complexity.\n\nOnly add OCR later for scanned files or mixed-quality inputs. OCRmyPDF is a good fallback because it adds a searchable text layer to scanned PDFs and is designed to tolerate files that mix scanned and born-digital content. (OCRmyPDF)\n\n### Step 2: inspect one page visually\n\nUse block extraction and plot the blocks or print their bounding boxes.\n\nYou want to answer:\n\n  * do the shipment summaries appear as repeated vertical blocks?\n  * are the invoice number and total in stable positions?\n  * are the label/value pairs close together?\n  * do all summaries have roughly the same width and height?\n\n\n\nIf yes, you can often segment the page with very simple geometry rules.\n\n### Step 3: segment one page into summary regions\n\nStart with rules, not AI.\n\nExamples:\n\n  * cluster words/blocks by vertical gaps\n  * detect repeated top labels like “Invoice,” “Shipment,” or “Tracking”\n  * use horizontal rules or whitespace bands if the PDF has them\n  * find repeated left edges and repeated heights\n\n\n\nBecause multiple summaries are on one page, this step is probably more important than OCR quality.\n\n### Step 4: parse one summary region locally\n\nOnce you isolate one region, do local extraction:\n\n  * find the invoice-number label inside that region\n  * look nearby for 4–6 digit candidates\n  * validate the winner with regex\n  * repeat for total, discount, freight, shipping\n\n\n\nThat local approach is much more robust than global regex over the whole document.\n\n## My recommendation about “paragraph extraction”\n\nFor a first exercise, yes, extracting paragraph-like text chunks is fine because it teaches:\n\n  * how to open a PDF in Python\n  * how to inspect blocks and words\n  * how to handle coordinates\n  * how to build a parser incrementally\n\n\n\nBut for your real shipper-summary use case, I would upgrade that idea to:\n\n**extract repeated blocks, not paragraphs**\n\nThat is the version that matches your document structure.\n\n## A practical beginner roadmap\n\n### Phase 1: no AI, no OCR\n\nUse **PyMuPDF**.\n\nGoal:\n\n  * extract blocks and words from one page\n  * print their coordinates\n  * manually identify where one shipment summary starts and ends\n\n\n\nPyMuPDF’s text recipes include block extraction, word extraction, extraction inside rectangles, and sorted text output. (PyMuPDF)\n\n### Phase 2: rule-based region segmentation\n\nWrite a small function that groups blocks into summary regions.\n\nGoal:\n\n  * get from “one page” to “N shipment summaries on that page”\n\n\n\n### Phase 3: field extraction inside each region\n\nUse local rules:\n\n  * regex only inside the region\n  * proximity to labels\n  * fallback rules if a label is missing\n\n\n\n### Phase 4: arithmetic validation\n\nBecause shipping, freight, discounts, and total all matter, add a check like:\n\n\n    total ≈ base_charge - discounts + shipping + freight + tax\n\n\nThe exact formula depends on the shipper’s layout, but the principle is stable: do not trust extracted numbers until they balance.\n\n## Where AI helps later\n\nAI becomes useful after you understand the document shape.\n\nFor your case, I would add AI later for one of three reasons:\n\n  * some PDFs are scans, so you need OCR\n  * the summary-region segmentation is inconsistent\n  * local rules for labels and fields become too brittle across shippers\n\n\n\nAt that point, current document models on Hugging Face such as **PaddleOCR-VL-1.5** , **GLM-OCR** , or layout companions like **PP-DocLayoutV3** become relevant, but I would not start there if your immediate goal is to learn the workflow and get a first success in Python. Those tools are better once you already know what one correct extracted record should look like. (Google Cloud)\n\n## What I would choose for you right now\n\nI would start with this exact stack:\n\n  * **PyMuPDF** for page, block, and word extraction\n  * **pdfplumber** only as a visual debugging helper when needed\n  * **no OCR** unless the sample PDFs turn out to be image-only\n  * **rule-based region segmentation**\n  * **local regex + label proximity** for field extraction\n  * **math validation** for shipping, freight, discounts, and totals\n\n\n\nThat is the simplest path that still matches your real document structure. PyMuPDF gives you the coordinates and block-level tools you need, and pdfplumber is helpful when you want to inspect how the PDF is laid out. (PyMuPDF)\n\n## The one change I would make to your plan\n\nYour instinct to start simpler is correct.\n\nI would just change the target from:\n\n**paragraph extraction**\n\nto:\n\n**summary-block extraction**\n\nThat single change aligns the project with the actual structure of your shipper PDF and gives you a much better chance of getting an early win.\n\n* * *\n\nStart with **PyMuPDF only**. It can extract **blocks** and **words** with coordinates, and `sort=True` can reorder output roughly from top-left to bottom-right. That is a much better first step than OCR for machine-generated PDFs, especially when one page contains multiple repeated shipment summaries. (PyMuPDF)\n\n\n    # deps:\n    #   pip install pymupdf\n    #\n    # notes:\n    # - No AI model. No OCR. CPU-safe.\n    # - Replace SAMPLE_PDF_URL later with your own PDF path or URL.\n    # - This is a first workflow script: download/open PDF -> extract text blocks ->\n    #   group nearby blocks into rough \"summary regions\" -> print/save results.\n\n    import json\n    import os\n    import urllib.request\n    import fitz  # PyMuPDF\n\n    # Public sample PDF for demo.\n    # Replace with your own local PDF path later, for example:\n    # PDF_SOURCE = \"my_shipper_summary.pdf\"\n    PDF_SOURCE = \"https://www.w3.org/WAI/WCAG20/Techniques/working-examples/PDF20/table.pdf\"\n\n    OUT_DIR = \"demo_pdf_blocks\"\n    PAGE_INDEX = 0          # first page only for the first experiment\n    GAP_THRESHOLD = 18.0    # larger => fewer/larger grouped regions\n\n    os.makedirs(OUT_DIR, exist_ok=True)\n\n    def ensure_local_pdf(src: str) -> str:\n        \"\"\"Download the PDF if src is a URL. Otherwise return the local path.\"\"\"\n        if src.startswith(\"http://\") or src.startswith(\"https://\"):\n            local_path = os.path.join(OUT_DIR, \"sample.pdf\")\n            if not os.path.exists(local_path):\n                print(f\"Downloading sample PDF to: {local_path}\")\n                urllib.request.urlretrieve(src, local_path)\n            return local_path\n        return src\n\n    def clean_text(s: str) -> str:\n        \"\"\"Normalize block text for easier printing.\"\"\"\n        return \" \".join(s.replace(\"\\x00\", \" \").split())\n\n    def group_blocks_into_regions(blocks, gap_threshold=18.0):\n        \"\"\"\n        Very simple region grouping:\n        - sort blocks top-to-bottom, then left-to-right\n        - start a new region when the vertical gap is large\n        This is only a first heuristic for repeated summary blocks.\n        \"\"\"\n        regions = []\n        current = []\n\n        for block in blocks:\n            x0, y0, x1, y1, text, block_no, block_type = block\n            if block_type != 0:  # keep text blocks only\n                continue\n            text = clean_text(text)\n            if not text:\n                continue\n\n            item = {\n                \"bbox\": [round(x0, 1), round(y0, 1), round(x1, 1), round(y1, 1)],\n                \"text\": text,\n                \"block_no\": int(block_no),\n            }\n\n            if not current:\n                current.append(item)\n                continue\n\n            prev_y1 = current[-1][\"bbox\"][3]\n            current_y0 = item[\"bbox\"][1]\n            vertical_gap = current_y0 - prev_y1\n\n            if vertical_gap > gap_threshold:\n                regions.append(current)\n                current = [item]\n            else:\n                current.append(item)\n\n        if current:\n            regions.append(current)\n\n        # Add combined region bbox + joined text\n        packed = []\n        for idx, region in enumerate(regions):\n            xs0 = [b[\"bbox\"][0] for b in region]\n            ys0 = [b[\"bbox\"][1] for b in region]\n            xs1 = [b[\"bbox\"][2] for b in region]\n            ys1 = [b[\"bbox\"][3] for b in region]\n            packed.append({\n                \"region_id\": idx,\n                \"bbox\": [min(xs0), min(ys0), max(xs1), max(ys1)],\n                \"text\": \"\\n\".join(b[\"text\"] for b in region),\n                \"blocks\": region,\n            })\n        return packed\n\n    # 1) Load PDF\n    pdf_path = ensure_local_pdf(PDF_SOURCE)\n    doc = fitz.open(pdf_path)\n    page = doc[PAGE_INDEX]\n\n    # 2) Extract blocks with sort=True\n    # PyMuPDF can also do get_text(\"words\", sort=True) later if you need finer control.\n    blocks = page.get_text(\"blocks\", sort=True)\n\n    # 3) Group blocks into rough page regions\n    regions = group_blocks_into_regions(blocks, gap_threshold=GAP_THRESHOLD)\n\n    # 4) Save raw outputs\n    raw_blocks_path = os.path.join(OUT_DIR, \"page_blocks.json\")\n    regions_path = os.path.join(OUT_DIR, \"page_regions.json\")\n    page_text_path = os.path.join(OUT_DIR, \"page_text.txt\")\n\n    with open(raw_blocks_path, \"w\", encoding=\"utf-8\") as f:\n        json.dump(blocks, f, indent=2, ensure_ascii=False)\n\n    with open(regions_path, \"w\", encoding=\"utf-8\") as f:\n        json.dump(regions, f, indent=2, ensure_ascii=False)\n\n    with open(page_text_path, \"w\", encoding=\"utf-8\") as f:\n        for region in regions:\n            f.write(f\"\\n=== REGION {region['region_id']} ===\\n\")\n            f.write(region[\"text\"])\n            f.write(\"\\n\")\n\n    # 5) Print a compact summary\n    print(f\"\\nPDF: {pdf_path}\")\n    print(f\"Pages: {doc.page_count}\")\n    print(f\"Using page index: {PAGE_INDEX}\")\n    print(f\"Text blocks found: {sum(1 for b in blocks if b[6] == 0)}\")\n    print(f\"Rough regions found: {len(regions)}\")\n\n    for region in regions:\n        x0, y0, x1, y1 = region[\"bbox\"]\n        preview = region[\"text\"][:200].replace(\"\\n\", \" | \")\n        print(f\"\\nREGION {region['region_id']}  bbox=({x0}, {y0}, {x1}, {y1})\")\n        print(f\"Preview: {preview}\")\n\n    print(\"\\nSaved:\")\n    print(\" -\", raw_blocks_path)\n    print(\" -\", regions_path)\n    print(\" -\", page_text_path)\n\n    doc.close()\n\n\nFor your shipper-summary PDF, the next step after this is to replace the simple vertical-gap grouping with **repeating summary-block detection** , then extract fields like invoice number, shipping, freight, discount, and total **inside each region only**. That avoids the reading-order problem that broke the whole-document regex approach. (PyMuPDF)",
  "title": "Invoice Data Recognition"
}