Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreie4g6oy3567jl4xv5jhdbdpd4bkc3b3wjjlgmjweh6wnl45mdltiy",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mlaztmz322e2"
  },
  "path": "/t/synthetic-australian-medical-pdf-library-50-doc-free-sample-feedback-wanted-on-the-dataset/175820#post_1",
  "publishedAt": "2026-05-07T10:04:16.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "RootCauseAnalytics/synthetic-australian-medical-documents-sample"
  ],
  "textContent": "Hi all,\n\nI released a synthetic medical document library this week and would love some community input on the format and dataset card before I push it more widely.\n\n**The dataset** : RootCauseAnalytics/synthetic-australian-medical-documents-sample\n\n**Quick summary** : 50-document free sample of a 5,000-doc library of synthetic Australian medical PDFs (NSW Health style, PHI-free, CC-BY-NC 4.0). 29 document types in the sample. Every labelled field has bounding-box ground-truth coordinates recorded by the generator at render time - so they’re pixel-exact, not OCR-approximated. Each document also ships with a scanned variant drawn from one of four quality tiers (clean / scanned / poor / fax) for OCR-robustness training.\n\n**Why this exists** : real Australian medical documents can’t legally be released for training, and existing public clinical-text libraries (MIMIC, etc.) are US-centric and document-format-free. So I built a deterministic Python pipeline that generates visually realistic NSW Health-style PDFs end-to-end with structured ground truth.\n\n* * *\n\n**Three specific things I’d love feedback on:**\n\n  1. **Should I add a HF-Datasets loading script?** Right now this is a raw-files dataset (PDFs + `ground_truth.csv` + `ground_truth.jsonl` + `bboxes.jsonl`). `load_dataset()` will auto-detect the CSV/JSONL but won’t surface the PDFs. Would a custom builder script that yields `(image_bytes, structured_fields, bboxes)` tuples be more useful, or do most people prefer raw files for document-AI work?\n\n  2. **Bbox format**. Coordinates are stored as `(x, y, width, height, page)` in PDF points, both as a `bboxes_json` column in the CSV and as a per-doc `bboxes.jsonl` index. Is this the format you’d want for training LayoutLMv3 / Donut / DocFormer pipelines, or would you prefer normalised coordinates, COCO format, or LayoutLMv3-native token-aligned boxes?\n\n  3. **Scan-degradation profiles**. The four-tier pipeline (clean / scanned / poor / fax) tries to cover the realistic range of medical document quality. Are there degradation profiles I’m missing that would be valuable for OCR-robustness benchmarking?\n\n\n\n\n* * *\n\nAlso open to general feedback on the dataset card itself - anything confusing, missing, or oversold? Happy to dig deeper into any part of the generation pipeline if anyone’s curious.\n\nThanks!",
  "title": "Synthetic Australian medical PDF library (50-doc free sample) - feedback wanted on the dataset"
}