{
"$type": "site.standard.document",
"bskyPostRef": {
"cid": "bafyreig2iwmgcj3kfijtcgevmy6ncwk7tgc4yjwrsbjboosk3eidtmosxu",
"uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mgwprzbt57b2"
},
"path": "/t/built-a-phi-de-identification-benchmark-that-tests-streaming-data-not-just-single-documents/174235#post_1",
"publishedAt": "2026-03-13T03:55:26.000Z",
"site": "https://discuss.huggingface.co",
"tags": [
"GitHub - azithteja91/phi-exposure-guard: Adaptive PHI de-identification for streaming multimodal data: exposure-aware, stateful and audit ready · GitHub"
],
"textContent": "Been working on this for a while and just pushed a big update. Wanted to share it here because I think it’s actually useful and I’d love to hear if people run into issues with it.\n\nThe core problem I kept running into: every PHI benchmark I found treats each clinical document as independent. You mask it, you score it. But that’s not how re-identification actually works. The threat is cumulative. A name in a note, the same name in an ASR transcript 10 minutes later, a matching date in imaging metadata after that. Each one looks clean. Together they’re a problem.\n\nSo I built a benchmark around that.\n\n**What it does**\n\nThe dataset simulates a multimodal clinical stream across text, ASR, image, waveform, and audio proxy events. An RL controller (PPO) watches cumulative re-identification risk build across events and selects a masking policy from five tiers: raw, weak, synthetic, pseudo, redact. The policy escalates as risk crosses configurable thresholds.\n\nThe result on the bursty workload: Privacy@HighRisk 0.9907, Utility@LowRisk 0.8466, at the same time. No static policy gets both. Always-Redact gets the privacy score but utility goes to zero. Always-Pseudo gets close on privacy but utility drops to 0.44. The re-identifier AUROC drops by 0.9167 across 10 runs with std 0.0.\n\n**What’s in this dataset**\n\nThree HuggingFace configs:\n\n`default` – 34 live adaptive masking events, each with full risk breakdown, policy decision, consent status, and CRDT risk\n\n`signed` – same events with ECDSA signatures and a Merkle chain, FHIR-exportable\n\n`crossmodal` – 260 rows across 5 scenarios testing cross-modal PHI linkage. Scenario E is an adversarial attacker that stays below individual risk thresholds while accumulating cross-modal links across modalities\n\nSupplementary files:\n\n * leakage breakdown by entity type (MRN, date, name, facility)\n\n * full risk component trace per event (units factor, recency, link bonus)\n\n * threshold sensitivity sweep across 8 values and 3 workloads\n\n * baseline comparison: 6 policies across 3 workloads, every number needed to reproduce the Pareto frontier\n\n\n\n\n**What’s new in this version**\n\nPreviously the dataset had placeholder charts and incomplete supplementary files. This version has everything regenerated from the actual run: risk trace, leakage breakdown, threshold sensitivity, and all four inline charts in the README are plotted from real event data.\n\nAlso added the adversarial crossmodal scenario, the FHIR audit trail, and the signed Merkle audit log for anyone working on compliance tooling.\n\n**MIT licensed, no DUA, fully synthetic**\n\ni2b2 and PhysioNet both require data use agreements. That makes sense given they have real patient data. It also means you can’t just clone and run without going through an approval process.\n\nThis has no real patient data. You can load it right now:\n\n\n from datasets import load_dataset\n\n ds = load_dataset(\"vkatg/streaming-phi-deidentification-benchmark\")\n cm = load_dataset(\"vkatg/streaming-phi-deidentification-benchmark\", \"crossmodal\")\n\n\n\nFull code is at GitHub - azithteja91/phi-exposure-guard: Adaptive PHI de-identification for streaming multimodal data: exposure-aware, stateful and audit ready · GitHub if you want to run the controller yourself or extend it.\n\nIf you’re building a de-identification system, doing privacy research, working on clinical NLP, or just want a streaming benchmark you can actually use without paperwork, give it a try. Happy to answer questions here or in the repo.",
"title": "built a PHI de-identification benchmark that tests streaming data, not just single documents"
}