External Publication
Visit Post

Synthetic Australian medical record PDF library (50-doc free sample) - feedback wanted on dataset

Hugging Face Forums [Unofficial] May 7, 2026
Source

Hi all,

I released a synthetic medical document library this week and would love some community input on the format and dataset card before I push it more widely.

The dataset : RootCauseAnalytics/synthetic-australian-medical-documents-sample

Quick summary : 50-document free sample of a 5,000-doc library of synthetic Australian medical PDFs (NSW Health style, PHI-free, CC-BY-NC 4.0). 29 document types in the sample. Every labelled field has bounding-box ground-truth coordinates recorded by the generator at render time - so they’re pixel-exact, not OCR-approximated. Each document also ships with a scanned variant drawn from one of four quality tiers (clean / scanned / poor / fax) for OCR-robustness training.

Why this exists : real Australian medical documents can’t legally be released for training, and existing public clinical-text libraries (MIMIC, etc.) are US-centric and document-format-free. So I built a deterministic Python pipeline that generates visually realistic NSW Health-style PDFs end-to-end with structured ground truth.


Three specific things I’d love feedback on:

  1. Should I add a HF-Datasets loading script? Right now this is a raw-files dataset (PDFs + ground_truth.csv + ground_truth.jsonl + bboxes.jsonl). load_dataset() will auto-detect the CSV/JSONL but won’t surface the PDFs. Would a custom builder script that yields (image_bytes, structured_fields, bboxes) tuples be more useful, or do most people prefer raw files for document-AI work?

  2. Bbox format. Coordinates are stored as (x, y, width, height, page) in PDF points, both as a bboxes_json column in the CSV and as a per-doc bboxes.jsonl index. Is this the format you’d want for training LayoutLMv3 / Donut / DocFormer pipelines, or would you prefer normalised coordinates, COCO format, or LayoutLMv3-native token-aligned boxes?

  3. Scan-degradation profiles. The four-tier pipeline (clean / scanned / poor / fax) tries to cover the realistic range of medical document quality. Are there degradation profiles I’m missing that would be valuable for OCR-robustness benchmarking?


Also open to general feedback on the dataset card itself - anything confusing, missing, or oversold? Happy to dig deeper into any part of the generation pipeline if anyone’s curious.

Thanks!

Discussion in the ATmosphere

Loading comments...