Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreict5s2utovmqoyym7cewq65q4vagwgz52pdnnba5h4kb4thghy3km",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mk7mrnfwofg2"
  },
  "path": "/t/the-unglamorous-bug-in-every-fine-tuning-tutorial-nobody-cleans-the-data/175507#post_1",
  "publishedAt": "2026-04-24T03:35:23.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "modelbrew/optimizer-noise-benchmark",
    "app.modelbrew.ai"
  ],
  "textContent": "Fine-tuning posts on this forum mostly focus on models, LoRA configs, and trainers. The input layer — the dataset — gets far less attention, even though most failed fine-tunes we see in practice come down to dirty data.\n\nAt ModelBrew we spend most of our time on this problem. So we ran a small reproducible experiment to put numbers on it.\n\n## Setup\n\nWe took four well-known instruction-tuning datasets from the Hub (searchable by name): `medalpaca/medical_meadow_medqa`, `b-mc2/sql-create-context`, `openai/gsm8k`, and `gbharti/finance-alpaca`.\n\nFor each, we converted rows to `{instruction, output}` JSONL, trimmed to roughly 2 MB, and injected a known, fixed amount of noise on top of the clean baseline:\n\nNoise category | Injected fraction\n---|---\nExact duplicate rows | ~1.0%\nEmpty `output` | ~1.0%\nOne-word outputs (`\"ok\"`, `\".\"`, `\"yes\"`) | ~1.0%\nHTML-wrapped outputs (`<p>…</p>`, `&`) | ~0.5%\nLeading/trailing whitespace + zero-width unicode | ~0.5%\n\nGround-truth counts per file are known to the row. We then scanned each poisoned file with the ModelBrew Dataset Optimizer.\n\n## What we injected, per file\n\nfile | rows | duplicate | empty | too-short | html | whitespace\n---|---|---|---|---|---|---\nfinance-alpaca | 1848 | 18 | 18 | 18 | 9 | 9\nmedical_meadow_medqa | 2033 | 20 | 20 | 20 | 10 | 10\ngsm8k | 3807 | 37 | 37 | 37 | 18 | 18\nsql-create-context | 7238 | 71 | 71 | 71 | 35 | 35\n\nThe Optimizer caught the injected noise across all four files. One surprise worth flagging: even _before_ we added any noise, the raw `finance-alpaca` source had real PII rows and “As an AI language model…” slop leaking through. The scanner caught those too. Public HF datasets are not as clean as people assume.\n\n## Why this matters for anyone fine-tuning here\n\nCommon data pathologies we see in customer datasets that don’t fail CI but quietly degrade the trained model:\n\n  * 30–50% near-duplicates from upstream scraping → wasted compute and implicit upweighting of a few examples.\n  * PII leaking through from customer support logs → memorization and extractable-at-inference risk.\n  * HTML/markdown from web scrapes → models learn to emit markup.\n  * A few percent of rows containing “As an AI language model…” slop → unintended persona injection.\n  * Empty or one-word completions teaching the model to output nothing.\n\n\n\nNone of this is visible from eyeballing the first 20 rows.\n\n## Artifacts and where to try it\n\nWe’ve published the 4 poisoned test files plus a JSON manifest of the exact injection counts as a Hub dataset so others can benchmark their own data-quality tools against the same ground truth: modelbrew/optimizer-noise-benchmark.\n\nThe scanner itself — score, per-issue breakdown, one-click autofix, and export — is a free tool at app.modelbrew.ai. No signup needed to scan a file.\n\nHappy to discuss methodology, add more datasets to the benchmark, or hear what noise categories we should inject next. If you’ve shipped a fine-tune that went sideways because of data, what was the root cause?",
  "title": "The unglamorous bug in every fine-tuning tutorial: nobody cleans the data"
}