Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreig24bo4zn57zoaj6j3zrqi6p7qqenobhhanb7ybcmo6qk7oelljou",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mjmckwjw7hi2"
  },
  "path": "/t/how-do-you-handle-sensitive-data-pii-in-datasets-before-training-models/175312#post_1",
  "publishedAt": "2026-04-16T11:27:51.000Z",
  "site": "https://discuss.huggingface.co",
  "textContent": "Hi everyone\n\nI’ve been working on a few ML pipelines recently (datasets + inference APIs), and I ran into something I don’t see discussed much:\n\nSensitive data leaking into datasets and logs\n\nIn a couple of cases, I noticed things like:\n\n  * Emails and phone numbers inside training data\n\n  * User inputs getting stored in logs during inference\n\n  * Internal IDs or tokens accidentally exposed\n\n\n\n\nCurious how others are handling this:\n\n  * Do you clean/anonymize datasets before training?\n\n  * Are you masking data in logs and API responses?\n\n  * Or is this something you only worry about for compliance later on?\n\n\n\n\nWhat surprised me:\n\nEven small datasets can contain a lot of hidden sensitive info, and once it’s used in:\n\n  * training\n\n  * embeddings\n\n  * logging systems\n\n\n\n\n…it becomes much harder to clean up afterward.\n\nWould love to hear:\n\n  * Your current approach\n\n  * Any tools or workflows you use\n\n  * Best practices you follow\n\n\n",
  "title": "How do you handle sensitive data (PII) in datasets before training models?"
}