Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreid5xqw24opbmg6cfrb2ovckyccju36cgfu3swlimqxqlyc5v2jd5q",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mhjuxxcer7a2"
  },
  "path": "/t/how-do-i-prepare-datasets-for-training-nlp-models/174420#post_2",
  "publishedAt": "2026-03-20T14:45:01.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "Hugging Face",
    "The NIST technical series.",
    "arXiv"
  ],
  "textContent": "In any case, unless you first decide for yourself on the “task you want to accomplish,” the “general type of model you’re training (LLM? Embeddings?),” the “quality required for the task,” and the “training method you’ll use for fine-tuning,” it may be difficult to actually create a useful dataset.\n\nOnce these details are settled, I think that in most cases, you just need to focus on the dataset’s format, quality, and volume, and you’ll be fine. It is generally said that the quality of the dataset is the most critical factor for the quality of the resulting model. (Of course, if the format is incorrect, it can ruin everything from a programming standpoint, but that goes without saying, so I’ll leave that aside.)\n\n* * *\n\nPrepare NLP training data as a **pipeline** , not as a one-time cleanup task. In the Hugging Face ecosystem, the dataset you train on usually ends up as one of a few shapes: plain text for language modeling, `text + label` for classification, token-level labels for NER and similar tasks, or `messages` / `prompt`-`completion` records for chat-style fine-tuning. That choice comes first, because the model learns from the final token sequence, not from your original raw files. (Hugging Face)\n\n## The big picture\n\nA good dataset is not just “a lot of text.” It is text that has been collected for a specific purpose, cleaned without destroying useful signal, split correctly, formatted the way the model expects, and documented so you can reproduce and audit it later. Hugging Face’s dataset card docs and the Datasheets for Datasets paper both treat documentation, provenance, intended use, and limitations as part of the dataset itself, not optional decoration. (Hugging Face)\n\nFor modern language models, especially chat models, formatting matters more than many beginners expect. Chat models are still causal language models under the hood. They just continue token sequences that encode roles like user and assistant. Different models use different control tokens and chat formats, so a dataset that looks “right” in JSON can still be wrong for training if the final templated token sequence is wrong. (Hugging Face)\n\n* * *\n\n## Step 1. Decide exactly what you are training\n\nBefore collecting anything, define the task in one sentence.\n\nExamples:\n\n  * “Classify support tickets into billing, technical, and account.”\n  * “Detect person, organization, and location entities in legal text.”\n  * “Teach an instruct model to answer in my domain style.”\n  * “Continue pretraining on biomedical text.”\n\n\n\nThat matters because each task wants a different dataset shape. Hugging Face’s task docs distinguish text classification, token classification, and causal language modeling as different tasks with different labels and preprocessing needs, and TRL separately defines language-modeling, prompt-only, prompt-completion, and preference datasets. (Hugging Face)\n\nA simple rule:\n\n  * **Classification** : one row = one text + one label.\n  * **NER / token classification** : one row = one text or sentence + one label per token.\n  * **Language modeling / continued pretraining** : one row = raw text.\n  * **Chat SFT** : one row = one conversation or one prompt-response example.\n  * **Preference tuning** : one row = prompt + chosen answer + rejected answer. (Hugging Face)\n\n\n\nIf you skip this step, the rest becomes confused. You start collecting text without knowing what a single training example is supposed to mean.\n\n* * *\n\n## Step 2. Define what one row should look like\n\nOnce the task is clear, define the schema.\n\n### Common schemas\n\n**Text classification**\n\n\n    {\"id\": \"001\", \"text\": \"My order still hasn't arrived.\", \"label\": \"shipping_delay\"}\n\n\n**Token classification / NER**\n\n\n    {\n      \"id\": \"002\",\n      \"tokens\": [\"John\", \"works\", \"at\", \"OpenAI\"],\n      \"ner_tags\": [\"B-PER\", \"O\", \"O\", \"B-ORG\"]\n    }\n\n\n**Plain text for language modeling**\n\n\n    {\"id\": \"003\", \"text\": \"Large language models predict the next token in a sequence.\"}\n\n\n**Chat fine-tuning**\n\n\n    {\n      \"id\": \"004\",\n      \"messages\": [\n        {\"role\": \"user\", \"content\": \"Summarize this article in 3 bullets.\"},\n        {\"role\": \"assistant\", \"content\": \"• ...\\n• ...\\n• ...\"}\n      ]\n    }\n\n\n**Prompt-completion**\n\n\n    {\n      \"id\": \"005\",\n      \"prompt\": \"Summarize this article in 3 bullets.\",\n      \"completion\": \"• ...\\n• ...\\n• ...\"\n    }\n\n\nThese shapes match what Hugging Face TRL documents as supported dataset types and formats. Conversational datasets use message lists with `role` and `content`, while standard prompt-completion and language-modeling formats use top-level text fields. (Hugging Face)\n\nIn practice, add metadata columns early. Good defaults are:\n\n  * `id`\n  * `source`\n  * `language`\n  * `timestamp`\n  * `license`\n  * `split`\n  * `quality_flag`\n\n\n\nThese are not required by the trainer, but they make debugging, filtering, and documentation much easier. Dataset cards on the Hub explicitly support metadata like license, language, and tags for discoverability and responsible use. (Hugging Face)\n\n* * *\n\n## Step 3. Collect data with provenance, rights, and realism in mind\n\nNow collect the text. The main goal is not maximum volume. The main goal is **match**.\n\nYour training data should resemble the data the model will see later. If the real task is messy customer messages, a dataset of polished essays is a poor fit. If the real task is short extraction answers, long free-form explanations may teach the wrong behavior. That is a general consequence of how the task formats in Transformers and TRL are defined: the model learns the sequence pattern you provide. (Hugging Face)\n\nWhile collecting, keep track of:\n\n  * where each sample came from,\n  * when it was collected,\n  * whether you are allowed to use it,\n  * what language it is in,\n  * and whether it contains sensitive material.\n\n\n\nThis is exactly the kind of information dataset cards and datasheets are meant to capture. (Hugging Face)\n\nA good mental model is to store three layers, not one:\n\n  * **raw** : untouched source data\n  * **clean** : normalized and filtered text\n  * **train-ready** : task-specific rows\n\n\n\nThat separation is not a Hugging Face requirement. It is a practical workflow decision. It makes mistakes reversible.\n\n* * *\n\n## Step 4. Choose storage formats that fit Hugging Face well\n\nIn the Hugging Face ecosystem, the easiest and safest dataset formats are file-backed formats that `load_dataset()` already understands. The official loading docs say it can load local and remote `csv`, `json`, `txt`, and `parquet` files. (Hugging Face)\n\nA practical default:\n\n  * **JSONL** for chat, prompt-completion, and nested examples\n  * **Parquet** for larger datasets and efficient storage\n  * **CSV** only for simple flat tables\n  * **TXT** for raw text corpora when each file or line maps naturally to plain text examples (Hugging Face)\n\n\n\nFor most NLP work in Hugging Face:\n\n  * start small with **JSONL** if you are still designing the schema,\n  * move to **Parquet** when the dataset gets large enough that storage, sharding, and speed matter more.\n\n\n\n* * *\n\n## Step 5. Clean the text conservatively\n\nCleaning is necessary, but over-cleaning is common.\n\nThe purpose of cleaning is to remove noise that does not help the model learn the task. It is not to make every row look pretty.\n\nTypical text cleaning steps:\n\n  * remove empty rows\n  * fix broken encodings\n  * normalize line endings\n  * normalize repeated whitespace\n  * strip obvious boilerplate if it is irrelevant\n  * remove exact duplicates\n  * filter out rows that are too short or clearly malformed\n  * standardize labels and metadata fields\n\n\n\nHugging Face Datasets is designed around this style of iterative processing with `map()` and `filter()`. The official process docs highlight `train_test_split()`, sharding, `map()`, `filter()`, and even on-the-fly transforms with `with_transform()`. The text-processing guide specifically recommends tokenizing with `map()`. (Hugging Face)\n\nA good test for whether you are cleaning too hard is this:\n\n> Does this change remove noise, or does it remove information the model should learn?\n\nFor example:\n\n  * Lowercasing may be fine for some classification tasks.\n  * Lowercasing can be harmful for NER, because capitalization is often a useful signal.\n  * Removing line breaks may be fine for sentiment classification.\n  * Removing line breaks can harm chat, code, legal text, and structured extraction tasks.\n\n\n\nSo clean for the task, not for aesthetics.\n\n* * *\n\n## Step 6. Remove or protect sensitive data early\n\nPrivacy is not a final polishing step. It belongs near ingestion.\n\nNIST’s Generative AI Profile states that training can involve large volumes of data and that such data may include personal data, creating privacy risks. (The NIST technical series.)\n\nThat means you should decide early:\n\n  * Is personal data allowed in training at all?\n  * Should names, emails, phone numbers, or IDs be removed?\n  * Do you need redaction, hashing, or pseudonymization?\n  * Should private and non-private variants of the dataset be stored separately?\n\n\n\nFor many practical NLP projects, removing or masking sensitive identifiers is the safe default unless there is a strong reason not to.\n\n* * *\n\n## Step 7. Deduplicate before you trust your dataset\n\nDuplicates waste compute and can also fake evaluation gains.\n\nHugging Face’s BigCode deduplication write-up makes two important points: deduplication improves training efficiency, and duplicates can create data leakage and benchmark contamination, making apparent improvements unreliable. (Hugging Face)\n\nCheck for:\n\n  * exact duplicate rows\n  * near duplicates\n  * duplicated documents under different filenames\n  * repeated conversations\n  * train/eval overlap after normalization\n\n\n\nThis matters even more than people think. A model that “improves” because the same examples appear in both train and test has not really improved.\n\n* * *\n\n## Step 8. Label the data in a way the model can actually learn from\n\nIf your task is supervised, labels must match the task structure.\n\nFor classification, one sample gets one label. For token classification, each token gets a label. Hugging Face’s task docs define these clearly: text classification assigns a label or class to a sequence, and token classification assigns a label to individual tokens. (Hugging Face)\n\nThat leads to two practical rules:\n\n  1. **Write annotation guidelines before labeling at scale.**\n  2. **Audit label consistency, not just label count.**\n\n\n\nIf two annotators would label the same example differently most of the time, the model is being asked to learn a contradiction.\n\nFor chat or instruction tuning, the “label” is usually the desired assistant response. In that case, good examples are not just correct. They are also consistent in tone, formatting, and task behavior.\n\n* * *\n\n## Step 9. Split the dataset before you start serious experimentation\n\nYou need at least:\n\n  * **train**\n  * **validation**\n  * **test**\n\n\n\nHugging Face Datasets provides `train_test_split()` for creating splits when they do not already exist. (Hugging Face)\n\nBut the important part is not just the function. It is the logic.\n\nA good split prevents leakage. Examples that are closely related should stay in the same split. For example:\n\n  * all chunks from one document,\n  * all turns from one conversation,\n  * all messages from one user session.\n\n\n\nOtherwise the evaluation becomes too easy.\n\nA practical baseline:\n\n  * train: 80–90%\n  * validation: 5–10%\n  * test: 5–10%\n\n\n\nUse larger test sets when evaluation stability matters more than squeezing out every last training example.\n\n* * *\n\n## Step 10. In Hugging Face, preprocess with `map()`, `filter()`, and tokenization\n\nThis is the heart of the workflow.\n\nHugging Face Datasets is built for transformation pipelines. The general process guide emphasizes `map()`, `filter()`, splits, sharding, and on-the-fly transforms. The text-processing guide specifically recommends tokenizing datasets with `map()`. (Hugging Face)\n\nTypical pattern:\n\n\n    from datasets import load_dataset\n\n    ds = load_dataset(\"json\", data_files=\"train.jsonl\", split=\"train\")\n\n    def clean(example):\n        text = example[\"text\"].strip()\n        return {\"text\": \" \".join(text.split())}\n\n    ds = ds.map(clean)\n    ds = ds.filter(lambda x: len(x[\"text\"]) > 20)\n\n\nThen tokenize:\n\n\n    from transformers import AutoTokenizer\n\n    tokenizer = AutoTokenizer.from_pretrained(\"distilbert-base-uncased\")\n\n    def tokenize(batch):\n        return tokenizer(batch[\"text\"], truncation=True)\n\n    tokenized = ds.map(tokenize, batched=True)\n\n\nThat pattern is directly aligned with the Hugging Face docs. (Hugging Face)\n\n* * *\n\n## Step 11. Use the right dataset object for the dataset size\n\nHugging Face has two important dataset modes:\n\n  * **Dataset** : map-style, random access, easier for normal-sized datasets\n  * **IterableDataset** : lazy, streamed, better for very large datasets\n\n\n\nThe official comparison page says `IterableDataset` is ideal for very large datasets because of its lazy behavior and speed advantages, while regular `Dataset` is great for everything else. It also explains that streaming loads data progressively without writing everything to disk. (Hugging Face)\n\nSo the rule is:\n\n  * If the dataset fits comfortably on disk and in your workflow, use `Dataset`.\n  * If it is huge, remote, or expensive to materialize, use `streaming=True` and work with an `IterableDataset`. (Hugging Face)\n\n\n\nFor streaming, remember one subtle point: shuffle is approximate. Hugging Face’s streaming docs explain that `IterableDataset.shuffle()` uses a buffer, so the randomness depends on `buffer_size`. (Hugging Face)\n\n* * *\n\n## Step 12. Tokenization is not just a technical step. It changes what the model sees\n\nBeginners often think tokenization is a final detail. It is not. It is the conversion from human-readable text into the actual training sequence.\n\nHugging Face’s text-processing docs recommend tokenizing with `map()`. Transformers’ language-modeling docs explain that causal language models predict the next token in a sequence and cannot see future tokens. (Hugging Face)\n\nThat means you should decide:\n\n  * maximum length,\n  * truncation policy,\n  * whether long documents should be chunked,\n  * and whether document boundaries should be preserved.\n\n\n\nFor classification, truncation may be acceptable if the label depends mostly on the beginning of the text. For legal, technical, or retrieval-style tasks, careless truncation can silently throw away the answer-bearing part of the sample.\n\n* * *\n\n## Step 13. If you are training a chat or instruct model, format becomes critical\n\nThis is the part most often done wrong.\n\nTransformers’ chat template docs explain that chat models still continue token sequences, but the `messages` list gets converted into a model-specific token sequence with control tokens. Different chat models can use different formats even when they come from the same base family. (Hugging Face)\n\nIn Hugging Face TRL:\n\n  * SFT supports language-modeling and prompt-completion datasets.\n  * `SFTTrainer` works with standard and conversational formats.\n  * If you pass a conversational dataset, the trainer automatically applies the model’s chat template. (Hugging Face)\n\n\n\nThat means your chat dataset should usually look like this:\n\n\n    {\n      \"messages\": [\n        {\"role\": \"system\", \"content\": \"You are a careful assistant.\"},\n        {\"role\": \"user\", \"content\": \"Explain overfitting in plain English.\"},\n        {\"role\": \"assistant\", \"content\": \"Overfitting means the model memorized patterns that do not generalize well...\"}\n      ]\n    }\n\n\nOr like this if you want prompt-completion:\n\n\n    {\n      \"prompt\": [{\"role\": \"user\", \"content\": \"Explain overfitting in plain English.\"}],\n      \"completion\": [{\"role\": \"assistant\", \"content\": \"Overfitting means...\"}]\n    }\n\n\nThose shapes are explicitly documented in TRL’s dataset-format guide. (Hugging Face)\n\n### A very important advanced caution\n\nIf you use `assistant_only_loss=True`, TRL says this only works with chat templates that support returning the assistant-token mask via `{% generation %}` blocks. If the template does not support that, your training setup may not behave as you expect. Similarly, prompt-completion datasets default to loss on completion tokens only unless you change that behavior. (Hugging Face)\n\nSo for chat training, always inspect:\n\n  * one raw example,\n  * one templated example,\n  * and the final tokenized sequence length.\n\n\n\n* * *\n\n## Step 14. Validate the dataset before training\n\nDo not start training just because loading works.\n\nBefore training, inspect:\n\n  * row counts by split,\n  * label distribution,\n  * language distribution,\n  * average and max sequence lengths,\n  * duplicate counts,\n  * malformed rows,\n  * several random examples from each split,\n  * several difficult edge-case examples.\n\n\n\nThis is partly common sense and partly a consequence of the contamination and formatting issues described earlier. If you do not manually inspect the data, you can run a long training job on rows that are technically valid but semantically wrong. Deduplication guidance and task formatting docs both point toward this kind of inspection. (Hugging Face)\n\nA useful sanity test is this:\n\n> Can you look at 20 random rows and explain why each one deserves to be in the dataset?\n\nIf not, the curation rules are probably too weak.\n\n* * *\n\n## Step 15. Document the dataset like a real artifact\n\nOnce the data is ready, create a dataset card.\n\nHugging Face’s dataset card docs say the `README.md` in the dataset repo is the dataset card, and that it should help users understand the contents of the dataset, how it should be used, and what biases or risks may exist. Metadata like license, language, size, and tags also live there. (Hugging Face)\n\nA practical dataset card should answer:\n\n  * What is this dataset for?\n  * Where did it come from?\n  * What preprocessing was applied?\n  * What labels or schemas does it use?\n  * What languages are included?\n  * What license applies?\n  * What sensitive-data handling was done?\n  * What are the known limitations?\n\n\n\nThat is also the spirit of Datasheets for Datasets. (arXiv)\n\n* * *\n\n## A practical Hugging Face workflow\n\nHere is a clean end-to-end pattern.\n\n### 1. Store raw data\n\nKeep source files untouched.\n\n### 2. Convert to a simple intermediate format\n\nUse JSONL or Parquet. Hugging Face loads both directly. (Hugging Face)\n\n### 3. Load with `load_dataset()`\n\n\n    from datasets import load_dataset\n    ds = load_dataset(\"json\", data_files={\"train\": \"train.jsonl\", \"validation\": \"valid.jsonl\"})\n\n\nThis matches the official loading API for file-backed datasets. (Hugging Face)\n\n### 4. Clean with `map()` and `filter()`\n\nNormalize text, drop junk, standardize labels. (Hugging Face)\n\n### 5. Split correctly\n\nUse preplanned train/validation/test splits or `train_test_split()` where appropriate. (Hugging Face)\n\n### 6. Tokenize\n\nUse `map()` for tokenization. (Hugging Face)\n\n### 7. For chat models, verify template behavior\n\nDo not assume the raw `messages` list is the final training format. (Hugging Face)\n\n### 8. Train\n\nUse the trainer appropriate to the task: `Trainer` for standard tasks, `SFTTrainer` for modern supervised chat fine-tuning. (Hugging Face)\n\n### 9. Publish with a dataset card\n\nInclude metadata and limitations. (Hugging Face)\n\n* * *\n\n## Common mistakes\n\n### Collecting first, defining the task later\n\nThis usually produces a pile of text with no stable row definition. The trainer can only learn from well-defined example shapes. (Hugging Face)\n\n### Over-cleaning\n\nRemoving casing, punctuation, or line structure can destroy information needed for the task.\n\n### Ignoring duplicates\n\nDuplicates hurt both training efficiency and honest evaluation. (Hugging Face)\n\n### Bad splits\n\nChunks from the same source ending up in train and test can make evaluation look much better than it really is.\n\n### Using the wrong schema for the trainer\n\nA row shaped for classification is not a row shaped for chat SFT. TRL and Transformers treat them differently. (Hugging Face)\n\n### Not checking the final chat template\n\nFor chat fine-tuning, this is one of the highest-risk mistakes. Different models expect different control tokens and templates. (Hugging Face)\n\n### No dataset documentation\n\nWithout documentation, future-you will not remember what was filtered, masked, merged, or relabeled. (Hugging Face)\n\n* * *\n\n## The shortest useful checklist\n\nUse this before training:\n\n  1. Define the task.\n  2. Define one row schema.\n  3. Collect only relevant, permitted data.\n  4. Keep raw and cleaned layers separate.\n  5. Clean lightly and intentionally.\n  6. Remove or mask sensitive data.\n  7. Deduplicate.\n  8. Split without leakage.\n  9. Store in JSONL or Parquet.\n  10. Load with `load_dataset()`.\n  11. Preprocess with `map()` and `filter()`.\n  12. Tokenize.\n  13. For chat models, verify the chat template and loss boundaries.\n  14. Inspect random samples manually.\n  15. Write a dataset card. (Hugging Face)\n\n",
  "title": "How do I prepare datasets for training NLP models?"
}