{
"$type": "site.standard.document",
"bskyPostRef": {
"cid": "bafyreig24bo4zn57zoaj6j3zrqi6p7qqenobhhanb7ybcmo6qk7oelljou",
"uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mjmckwjw7hi2"
},
"path": "/t/how-do-you-handle-sensitive-data-pii-in-datasets-before-training-models/175312#post_1",
"publishedAt": "2026-04-16T11:27:51.000Z",
"site": "https://discuss.huggingface.co",
"textContent": "Hi everyone\n\nI’ve been working on a few ML pipelines recently (datasets + inference APIs), and I ran into something I don’t see discussed much:\n\nSensitive data leaking into datasets and logs\n\nIn a couple of cases, I noticed things like:\n\n * Emails and phone numbers inside training data\n\n * User inputs getting stored in logs during inference\n\n * Internal IDs or tokens accidentally exposed\n\n\n\n\nCurious how others are handling this:\n\n * Do you clean/anonymize datasets before training?\n\n * Are you masking data in logs and API responses?\n\n * Or is this something you only worry about for compliance later on?\n\n\n\n\nWhat surprised me:\n\nEven small datasets can contain a lot of hidden sensitive info, and once it’s used in:\n\n * training\n\n * embeddings\n\n * logging systems\n\n\n\n\n…it becomes much harder to clean up afterward.\n\nWould love to hear:\n\n * Your current approach\n\n * Any tools or workflows you use\n\n * Best practices you follow\n\n\n",
"title": "How do you handle sensitive data (PII) in datasets before training models?"
}