External Publication

How do you handle sensitive data (PII) in datasets before training models?

Hugging Face Forums [Unofficial] April 16, 2026

Hi everyone I’ve been working on a few ML pipelines recently (datasets + inference APIs), and I ran into something I don’t see discussed much: Sensitive data leaking into datasets and logs In a couple of cases, I noticed things like: * Emails and phone numbers inside training data * User inputs getting stored in logs during inference * Internal IDs or tokens accidentally exposed Curious how others are handling this: * Do you clean/anonymize datasets before training? * Are you masking data in logs and API responses? * Or is this something you only worry about for compliance later on? What surprised me: Even small datasets can contain a lot of hidden sensitive info, and once it’s used in: * training * embeddings * logging systems …it becomes much harder to clean up afterward. Would love to hear: * Your current approach * Any tools or workflows you use * Best practices you follow

Discussion in the ATmosphere