Synthetic Australian medical record PDF library (50-doc free sample) - feedback wanted on dataset
Hi all,
I released a synthetic medical document library this week and would love some community input on the format and dataset card before I push it more widely.
The dataset : RootCauseAnalytics/synthetic-australian-medical-documents-sample
Quick summary : 50-document free sample of a 5,000-doc library of synthetic Australian medical PDFs (NSW Health style, PHI-free, CC-BY-NC 4.0). 29 document types in the sample. Every labelled field has bounding-box ground-truth coordinates recorded by the generator at render time - so they’re pixel-exact, not OCR-approximated. Each document also ships with a scanned variant drawn from one of four quality tiers (clean / scanned / poor / fax) for OCR-robustness training.
Why this exists : real Australian medical documents can’t legally be released for training, and existing public clinical-text libraries (MIMIC, etc.) are US-centric and document-format-free. So I built a deterministic Python pipeline that generates visually realistic NSW Health-style PDFs end-to-end with structured ground truth.
Three specific things I’d love feedback on:
Should I add a HF-Datasets loading script? Right now this is a raw-files dataset (PDFs +
ground_truth.csv+ground_truth.jsonl+bboxes.jsonl).load_dataset()will auto-detect the CSV/JSONL but won’t surface the PDFs. Would a custom builder script that yields(image_bytes, structured_fields, bboxes)tuples be more useful, or do most people prefer raw files for document-AI work?Bbox format. Coordinates are stored as
(x, y, width, height, page)in PDF points, both as abboxes_jsoncolumn in the CSV and as a per-docbboxes.jsonlindex. Is this the format you’d want for training LayoutLMv3 / Donut / DocFormer pipelines, or would you prefer normalised coordinates, COCO format, or LayoutLMv3-native token-aligned boxes?Scan-degradation profiles. The four-tier pipeline (clean / scanned / poor / fax) tries to cover the realistic range of medical document quality. Are there degradation profiles I’m missing that would be valuable for OCR-robustness benchmarking?
Also open to general feedback on the dataset card itself - anything confusing, missing, or oversold? Happy to dig deeper into any part of the generation pipeline if anyone’s curious.
Thanks!
Discussion in the ATmosphere