How do you handle sensitive data (PII) in datasets before training models?
Hugging Face Forums [Unofficial]
April 16, 2026
Hi everyone
I’ve been working on a few ML pipelines recently (datasets + inference APIs), and I ran into something I don’t see discussed much:
Sensitive data leaking into datasets and logs
In a couple of cases, I noticed things like:
* Emails and phone numbers inside training data
* User inputs getting stored in logs during inference
* Internal IDs or tokens accidentally exposed
Curious how others are handling this:
* Do you clean/anonymize datasets before training?
* Are you masking data in logs and API responses?
* Or is this something you only worry about for compliance later on?
What surprised me:
Even small datasets can contain a lot of hidden sensitive info, and once it’s used in:
* training
* embeddings
* logging systems
…it becomes much harder to clean up afterward.
Would love to hear:
* Your current approach
* Any tools or workflows you use
* Best practices you follow
Discussion in the ATmosphere