External Publication

built a PHI de-identification benchmark that tests streaming data, not just single documents

Hugging Face Forums [Unofficial] March 13, 2026

Been working on this for a while and just pushed a big update. Wanted to share it here because I think it’s actually useful and I’d love to hear if people run into issues with it.

The core problem I kept running into: every PHI benchmark I found treats each clinical document as independent. You mask it, you score it. But that’s not how re-identification actually works. The threat is cumulative. A name in a note, the same name in an ASR transcript 10 minutes later, a matching date in imaging metadata after that. Each one looks clean. Together they’re a problem.

So I built a benchmark around that.

What it does

The dataset simulates a multimodal clinical stream across text, ASR, image, waveform, and audio proxy events. An RL controller (PPO) watches cumulative re-identification risk build across events and selects a masking policy from five tiers: raw, weak, synthetic, pseudo, redact. The policy escalates as risk crosses configurable thresholds.

The result on the bursty workload: Privacy@HighRisk 0.9907, Utility@LowRisk 0.8466, at the same time. No static policy gets both. Always-Redact gets the privacy score but utility goes to zero. Always-Pseudo gets close on privacy but utility drops to 0.44. The re-identifier AUROC drops by 0.9167 across 10 runs with std 0.0.

What’s in this dataset

Three HuggingFace configs:

default – 34 live adaptive masking events, each with full risk breakdown, policy decision, consent status, and CRDT risk

signed – same events with ECDSA signatures and a Merkle chain, FHIR-exportable

crossmodal – 260 rows across 5 scenarios testing cross-modal PHI linkage. Scenario E is an adversarial attacker that stays below individual risk thresholds while accumulating cross-modal links across modalities

Supplementary files:

leakage breakdown by entity type (MRN, date, name, facility)
full risk component trace per event (units factor, recency, link bonus)
threshold sensitivity sweep across 8 values and 3 workloads
baseline comparison: 6 policies across 3 workloads, every number needed to reproduce the Pareto frontier

What’s new in this version

Previously the dataset had placeholder charts and incomplete supplementary files. This version has everything regenerated from the actual run: risk trace, leakage breakdown, threshold sensitivity, and all four inline charts in the README are plotted from real event data.

Also added the adversarial crossmodal scenario, the FHIR audit trail, and the signed Merkle audit log for anyone working on compliance tooling.

MIT licensed, no DUA, fully synthetic

i2b2 and PhysioNet both require data use agreements. That makes sense given they have real patient data. It also means you can’t just clone and run without going through an approval process.

This has no real patient data. You can load it right now:

from datasets import load_dataset

ds = load_dataset("vkatg/streaming-phi-deidentification-benchmark")
cm = load_dataset("vkatg/streaming-phi-deidentification-benchmark", "crossmodal")

Full code is at GitHub - azithteja91/phi-exposure-guard: Adaptive PHI de-identification for streaming multimodal data: exposure-aware, stateful and audit ready · GitHub if you want to run the controller yourself or extend it.

If you’re building a de-identification system, doing privacy research, working on clinical NLP, or just want a streaming benchmark you can actually use without paperwork, give it a try. Happy to answer questions here or in the repo.

Discussion in the ATmosphere