Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreih4zmxaumu3u5fq6jvbhncfv666crysk3pl4zaudjrz2gtngrmyqi",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mfltwp7fzey2"
  },
  "path": "/t/llm-course-training-a-causal-language-model-from-scratch/173736#post_2",
  "publishedAt": "2026-02-24T08:26:22.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "Hugging Face",
    "Hugging Face Forums",
    "colab.research.google.com",
    "GitHub"
  ],
  "textContent": "Oh… Seems normal.\n\n* * *\n\n## 1) Why this takes so long in Colab, and why the GPU is idle\n\n### What that `map(tokenize, ...)` cell is doing\n\nIn the course section you linked, the dataset is large and this preprocessing step **creates a much larger dataset than you started with**.\n\n  * You start with **~606,720 training files**. (Hugging Face)\n  * With `context_length=128` and `return_overflowing_tokens=True`, each long file is split into many 128-token chunks. The course shows this becomes **~16.7 million** training sequences after preprocessing. (Hugging Face)\n\n\n\nSo the time isn’t just “tokenize 600k strings”; it’s also “**write tens of millions of rows** to Arrow cache files”, which is often dominated by **CPU + disk I/O** on Colab.\n\n### Why you get “not using GPU”\n\nTokenization in the Hugging Face stack is **CPU-side** (including the “Fast” Rust tokenizers). HF maintainers explicitly state tokenization does not run on GPU. (Hugging Face Forums)\n\n`datasets.Dataset.map()` only “uses GPU” if your mapped function itself performs CUDA work (e.g., model inference inside `map()`), and if you do that with multiprocessing you must use the `spawn` start method to avoid CUDA fork errors. (Hugging Face)\nThat’s not your situation: your `tokenize()` is pure CPU/string work.\n\n### Is “> 1 hour” reasonable?\n\nYes—given the expansion to ~16.7M examples and heavy cache writing, “hour-scale” preprocessing is plausible on Colab. (Hugging Face)\nIt’s also common to see **speed drop near the end** as writing/merging becomes the bottleneck. (Hugging Face Forums)\n\n* * *\n\n## 1A) What actually speeds up `Dataset.map()` in your case (CPU + I/O tuning)\n\nThese are the knobs that matter for large tokenization jobs:\n\n### Use batched mapping and tune `batch_size`\n\nDatasets defaults to **batch_size=1000** when `batched=True`, and you can adjust it. (Hugging Face)\nLarger batches often improve throughput until you hit RAM limits.\n\n### Use CPU multiprocessing: `num_proc`\n\nTokenization is CPU-bound, so `num_proc=2` or `4` can help on Colab (depending on available cores). The Datasets processing guide covers batched mapping and processing functions. (Hugging Face)\n\n### Reduce cache write overhead: `writer_batch_size`\n\n`writer_batch_size` controls how many rows are written per operation. The docs state the default is 1000 and explain the speed/memory tradeoff. (Hugging Face)\nHF staff also point to `writer_batch_size` as the parameter to reduce frequent flushing when mapping large datasets. (Hugging Face Forums)\n\n### Practical Colab configuration (good starting point)\n\n\n    import os\n    os.environ[\"TOKENIZERS_PARALLELISM\"] = \"false\"  # avoids common fork/parallelism warnings\n\n    tokenized_datasets = raw_datasets.map(\n        tokenize,\n        batched=True,\n        batch_size=2000,        # try 1000, 2000, 5000\n        num_proc=2,             # try 2 (then 4 if you have cores)\n        writer_batch_size=5000, # try 2000–20000\n        remove_columns=raw_datasets[\"train\"].column_names,\n        desc=\"Tokenizing\",\n    )\n\n\n### Workflow tip that saves the most time\n\nUse a **subset** to debug the pipeline end-to-end before you run the full preprocessing. The official course notebook is meant to be runnable in Colab, but full preprocessing can be expensive. (colab.research.google.com)\n\n* * *\n\n## 2) Packing before chunking: your approach is conceptually correct (here’s the efficient version)\n\n### Background: why packing helps\n\nYour original function discards the remainder of each document after chunking to `context_length`. Packing reduces this waste by:\n\n  1. inserting EOS between documents,\n  2. concatenating tokens into a stream,\n  3. chunking the stream into fixed-size blocks.\n\n\n\nThis is the same overall strategy as the canonical CLM preprocessing pattern used in HF’s `run_clm.py`, which concatenates then chunks (`group_texts`). (GitHub)\n\n### The main performance pitfall\n\nDon’t build one gigantic concatenated list for the entire dataset in memory. Pack **within each batch** (`batched=True`), which keeps memory bounded and is what Datasets is optimized for. (Hugging Face)\n\nAlso avoid repeated `+` concatenations in a loop (can become quadratic). Prefer `extend()` or `itertools.chain`.\n\n### A fast “pack then chunk” `tokenize()` (minimal changes)\n\nThis replaces `return_overflowing_tokens` with explicit packing:\n\n\n    from itertools import chain\n\n    def tokenize(element):\n        outputs = tokenizer(\n            element[\"content\"],\n            truncation=False,\n            add_special_tokens=False,\n        )\n\n        eos = tokenizer.eos_token_id\n        # Add EOS after each document, then flatten efficiently\n        stream = list(chain.from_iterable(ids + [eos] for ids in outputs[\"input_ids\"]))\n\n        # Chunk\n        total_len = (len(stream) // context_length) * context_length\n        if total_len == 0:\n            return {\"input_ids\": []}\n\n        input_ids = [stream[i:i+context_length] for i in range(0, total_len, context_length)]\n        return {\"input_ids\": input_ids}\n\n\nAnd map it (with the same speed knobs as above):\n\n\n    tokenized_datasets = raw_datasets.map(\n        tokenize,\n        batched=True,\n        batch_size=2000,\n        num_proc=2,\n        writer_batch_size=5000,\n        remove_columns=raw_datasets[\"train\"].column_names,\n    )\n\n\n### One subtle limitation (and how to mitigate it)\n\nPacking inside `map(batched=True)` packs **within each batch** , so any leftover tokens at the end of a batch are dropped. Increase `batch_size` to reduce this boundary waste. (Hugging Face)\n\n### “Is it okay that tokens can attend across document boundaries?”\n\nWith standard CLM pretraining, this is commonly accepted; EOS is the boundary signal. If you wanted strict isolation, you’d need special attention masking or a different packing strategy—more complex than what this exercise targets.\n\n* * *\n\n## Quick checklist for your Colab run\n\n  * Accept that **GPU is not used during tokenization** ; GPU matters during training. (Hugging Face Forums)\n  * Expect preprocessing to take a long time because it expands to **millions of chunks**. (Hugging Face)\n  * Speed it up with `batch_size`, `num_proc`, `writer_batch_size`. (Hugging Face)\n  * Implement packing by tokenizing with `truncation=False`, inserting EOS, concatenating per batch, then chunking (mirrors `run_clm.py`’s concatenate→chunk logic). (GitHub)\n\n\n\nIf you apply the packing function and the three map knobs (`batch_size`, `num_proc`, `writer_batch_size`), you typically get (a) less wasted data, and (b) more predictable preprocessing time on Colab, without trying to force GPU use where it doesn’t apply.",
  "title": "LLM course - training a causal language model from scratch"
}