Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreicv7gg7n7z3bl2gsmfe72pvk3icktj4pj3mmbweikkcfuhrlmnpli",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mgzpzl33rox2"
  },
  "path": "/t/missing-info-in-llm-course/174258#post_3",
  "publishedAt": "2026-03-14T06:09:26.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "Hugging Face",
    "jalammar.github.io",
    "nlp.seas.harvard.edu",
    "Stanford CS336",
    "Hugging Face Forums",
    "GitHub",
    "arXiv"
  ],
  "textContent": "If I put together a rough collection using online resources, something like this:\n\n* * *\n\nThe Hugging Face LLM Course does include tokenizer training and a **scaled-down** causal-LM-from-scratch example. The deeper large-scale training material sits elsewhere, especially in HF’s training handbook/playbook, while **CS336** is the closest thing to a full “language modeling from scratch” course. (Hugging Face)\n\n## Track 1 — Understand what you are training\n\n**Use this track when:**\nTransformers still feel blurry. You can run code, but you do not yet have a clean mental model of tokenization, next-token prediction, attention, and causal LM training.\n\n**Goal:**\nUnderstand the pipeline end to end before caring about scale.\n\n**Read in this order:**\n\n  1. **The Illustrated Transformer**\nBest first stop. It is a visual introduction meant to simplify the Transformer concepts step by step. (jalammar.github.io)\n\n  2. **The Annotated Transformer**\nBest second stop. It is the implementation-oriented bridge from the paper to working code. (nlp.seas.harvard.edu)\n\n  3. **CS336: Language Modeling from Scratch**\nBest full course for the whole stack: tokenization, data, model construction, training, and evaluation. (Stanford CS336)\n\n  4. **HF Chapter 6**\nUse this to learn tokenization properly. HF explicitly states that training a tokenizer is **not** the same as training a model, and Chapter 6 is about training a brand new tokenizer so it can later be used to pretrain a language model. (Hugging Face)\n\n  5. **HF Chapter 7.6**\nThen do one small from-scratch causal LM run in the HF style. HF describes this section as training a completely new model from scratch on a Python code corpus with `Trainer` and Accelerate. (Hugging Face)\n\n\n\n\n**You are done with Track 1 when:**\nYou can explain why causal LM predicts the next token, why tokenization matters, why `labels` can look identical to `input_ids` in HF’s setup, and why packing text before chunking improves efficiency. The HF forums repeatedly show that these are the exact points where learners get confused. (Hugging Face Forums)\n\n**Main pitfalls:**\n\n  * Confusing **tokenizer training** with **model training**. HF explicitly separates them. (Hugging Face)\n  * Thinking the course example is “real full-scale pretraining.” It is not. HF presents it as a reduced, practical example. (Hugging Face)\n  * Getting stuck on label shifting. In the common HF causal-LM path, the shift happens inside the model. (Hugging Face Forums)\n\n\n\n* * *\n\n## Track 2 — Build a small LLM yourself\n\n**Use this track when:**\nYou want runnable code and a model you actually trained, even if it is small.\n\n**Goal:**\nTrain a tiny but real decoder-only model end to end.\n\n**Build in this order:**\n\n  1. **`minbpe`**\nStart here for tokenization code. It is a minimal, clean implementation of byte-level BPE, which is commonly used in LLM tokenization. (GitHub)\n\n  2. **`build-nanogpt`**\nThen build a GPT-like model step by step. The repo is intentionally structured so you can follow the commit history as the system is built gradually. (GitHub)\n\n  3. **Raschka’s`LLMs-from-scratch`**\nUse this when you want a more complete, careful, beginner-friendly path. The repo explicitly covers developing, pretraining, and finetuning a GPT-like LLM. (GitHub)\n\n  4. **HF blog: “How to train a new language model from scratch”**\nThis is the compact HF version of the whole pipeline: find data, train a tokenizer, train a language model, validate it, then fine-tune it. (Hugging Face)\n\n  5. **HF blog: “Training CodeParrot from Scratch”**\nUse this when you want a more realistic HF pretraining case than the course toy example. HF describes it as a step-by-step guide to training a large GPT-2 model for code from scratch. (Hugging Face)\n\n  6. **LitGPT pretraining tutorial**\nGood modern practical stack after the educational repos. LitGPT documents `litgpt pretrain`, and the project positions itself as high-performance LLM recipes for pretraining, finetuning, and deployment. (GitHub)\n\n  7. **TinyStories**\nBest cheap sandbox. The paper shows that very small models can learn coherent English on this dataset, which makes it unusually good for low-budget experiments. (arXiv)\n\n\n\n\n**A good endpoint for Track 2:**\nTrain a small model on TinyStories or a narrow code corpus, then inspect generations, training loss, and preprocessing behavior.\n\n**Main pitfalls:**\n\n  * Preprocessing can dominate your runtime. A recent HF forum thread shows learners getting stuck on long tokenization and chunking steps in Colab. (Hugging Face Forums)\n  * Packing matters. The efficient pattern is to concatenate samples with EOS and then chunk, instead of wasting short remainders. HF’s course and example scripts use this logic. (GitHub)\n  * Naive perplexity evaluation is misleading for fixed-length models. HF recommends a sliding-window strategy. (Hugging Face)\n  * Official examples can still break or drift. HF course discussions document typos and deprecated API usage in the training section. (Hugging Face Forums)\n\n\n\n* * *\n\n## Track 3 — Learn serious pretraining engineering\n\n**Use this track when:**\nYou already understand the basics and have trained small models. Now the problem is **throughput, memory, parallelism, stability, and scaling**.\n\n**Goal:**\nUnderstand how real pretraining systems are organized.\n\n**Read in this order:**\n\n  1. **HF LLM Training Playbook**\nStart here for the overview. HF describes it as an open collection of implementation tips, tricks, and resources for training large language models. (GitHub)\n\n  2. **HF LLM Training Handbook**\nThen go deeper. HF explicitly says this is technical material for LLM training engineers and operators. (GitHub)\n\n  3. **Megatron-LM / Megatron Core**\nStudy this when you need large-scale distributed training concepts. NVIDIA describes Megatron-LM as a reference example for research teams and distributed training, and Megatron Core as the high-performance building blocks for large-scale generative AI training. (GitHub)\n\n  4. **HF Accelerate Megatron-LM guide**\nUseful bridge if you already know HF tooling and want to see how large-scale GPT pretraining connects to their example scripts. (Hugging Face)\n\n  5. **Nanotron**\nGood alternative if you want a simpler, flexible pretraining library that is still designed for speed and scale. (GitHub)\n\n  6. **Pythia**\nBest for studying training dynamics rather than only final checkpoints. The project provides models and many checkpoints specifically to support research into how LMs evolve across training and scale. (GitHub)\n\n\n\n\n**What Track 3 is really about:**\nNot “how to make a toy model run,” but “how to make training stable, efficient, parallel, debuggable, and reproducible.” That is exactly how HF positions its handbook/playbook material. (GitHub)\n\n**Main pitfalls:**\n\n  * Entering this track too early. These resources assume you already know the basics.\n  * Treating distributed training as the first thing to learn, instead of the last thing.\n  * Ignoring real-world friction. Even practical scaling libraries still have active bug reports and setup issues. (GitHub)\n\n\n\n* * *\n\n## Best default path for most people\n\nFor most people, the best order is:\n\n**Track 1 → Track 2 → selective parts of Track 3**\n\nThat sequence matches the way the resources themselves are split: the HF course teaches the workflow and a small from-scratch example, while the handbook/playbook and systems libraries target training engineers working on scale and operations. (Hugging Face)\n\n## Short version\n\n  * **Track 1** = understand transformers and causal LM training\n  * **Track 2** = train a small model yourself\n  * **Track 3** = learn real pretraining systems and scaling\n\n\n\nThe most common mistake is skipping Track 1, dabbling in Track 3, and never finishing Track 2.",
  "title": "Missing info in LLM Course?"
}