Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreiebrjukrphgeafiislto23ieww53fkwrfe57phrisem3pgbs2nnwm",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mghv5zedni32"
  },
  "path": "/t/qwen3-5-4b-loss-exploding/174057#post_2",
  "publishedAt": "2026-03-07T12:43:56.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "GitHub",
    "Hugging Face",
    "Qwen",
    "PyTorch Docs"
  ],
  "textContent": "When it comes to failures specific to fine-tuning newer Qwen models like Qwen 3.5, `<think>`-related issues are the first to be suspected.\n\n* * *\n\n## My diagnosis\n\nThis looks much more like a **formatting / masking / tokenizer / precision** problem than a pure learning-rate problem.\n\nWhy: in the curves you shared, training is usable for a while, then there is a **sharp regime change** : loss jumps, grad norm becomes erratic, and the run degrades even though LR is already past warmup and moving down. That pattern matches “a subset of batches is pathological” better than “the global step size is too high.” TRL’s own docs also note that adapter training commonly uses a **higher** LR around `1e-4`, so a run failing at `1e-5` or `2e-5` is a strong hint that something else is wrong first. (GitHub)\n\n* * *\n\n## The most likely causes, ranked\n\n### 1) Your data is probably mismatched to how Qwen3.5 expects reasoning to look\n\nQwen3.5 is not just a generic chat model. Its current docs say it **thinks by default before responding** , and direct non-thinking responses are obtained by explicitly disabling thinking in the chat-template/API configuration. The Qwen docs and model card also recommend standardized formatting for outputs, and for Qwen3 they explicitly note that historical turns should keep only the **final output** , not the thinking content, unless your framework is handling that correctly for you. (Hugging Face)\n\nThat matters because your training data is a shuffled mixture of reasoning-heavy outputs from different teacher families. Even if the content quality is high, the **target format** is likely inconsistent:\n\n  * different reasoning style\n  * different boundary between reasoning and final answer\n  * some samples may be answer-only\n  * some may be long enough that the useful supervised region gets truncated\n  * some may include reasoning patterns that do not align with Qwen’s template assumptions\n\n\n\nOnline Qwen training guidance does **not** recommend casually mixing arbitrary reasoning traces. Qwen’s own training docs say that if you fine-tune with data **without chain of thought** but want to preserve reasoning ability, you should handle it explicitly with things like `ignore_empty_think` or a non-thinking prefix / instruction, rather than letting formats mix implicitly. The ms-swift Qwen3.5 examples also use `add_non_thinking_prefix`, `ignore_empty_think`, `bfloat16`, `max_length 2048`, `warmup_ratio 0.05`, and a LoRA LR of `1e-4`, which reinforces that the recommended baseline is a **controlled format** , not raw mixed reasoning dumps. (Qwen)\n\n**My view:** this is the highest-probability root cause in your case.\n\n* * *\n\n### 2) Assistant-only masking or truncation is probably breaking supervision on long samples\n\nThis is one of the closest known failure modes.\n\nTRL’s SFT docs say `assistant_only_loss=True` only works for templates that can return the assistant token mask correctly. They also document that truncation matters, and a recent TRL issue shows a concrete failure: when assistant tokens occur only **after** `max_length`, `assistant_masks` can become all zeros, which leads to labels that are entirely `-100`. (GitHub)\n\nThat maps directly onto your setup:\n\n  * reasoning-heavy teacher outputs are long\n  * long samples are more likely to push the actual assistant answer past `max_length`\n  * once that happens, some batches have almost no meaningful supervision\n  * those batches produce nonsense gradients or highly erratic updates\n\n\n\nThis is exactly the kind of thing that lowering LR does **not** fix. It just delays when the optimizer encounters enough bad batches to visibly break.\n\n* * *\n\n### 3) Chat template / EOS handling is a very strong suspect for Qwen-family SFT\n\nTransformers’ chat-template docs are explicit: chat templates already include the necessary special tokens, and if you format with `apply_chat_template(tokenize=False)` and then tokenize again with `add_special_tokens=True`, you can accidentally **duplicate** BOS/EOS/control tokens and hurt performance. They specifically say `apply_chat_template(tokenize=True)` is often safer for that reason. (Hugging Face)\n\nThere is also a TRL issue specific to Qwen where `_prepare_dataset()` appended an **extra EOS token** for Qwen chat formatting, creating endings like `<|im_end|>\\n<|im_end|>`. That is exactly the kind of subtle corruption that does not always fail immediately, but can create unstable late-training behavior. (GitHub)\n\nSo if your pipeline is doing any of the following, it is dangerous:\n\n  * applying a chat template manually and then letting the trainer apply one again\n  * formatting text first, then tokenizing later with `add_special_tokens=True`\n  * mixing teacher-formatted strings with model-native chat formatting\n  * manually appending EOS / turn-end markers on top of a tokenizer that already does it\n\n\n\nThis is a top-tier suspect.\n\n* * *\n\n### 4) If you are on a Qwen 4-bit stack, a wrong pad token can literally cause exploding gradients\n\nThere is a recent Qwen-specific bug report showing that some 4-bit tokenizers used the wrong pad token: `'<|vision_pad|>'` instead of `'<|endoftext|>'`. The report says this caused **NaN gradients** and exploding training when padding was present, especially with batch size greater than 1. (GitHub)\n\nThis is not my first guess if your true microbatch is always 1 and packing is off. But it becomes highly relevant if any of these are true:\n\n  * packing is on\n  * your collator still pads aggressively\n  * your “batch size 1” screenshot does not match the actual runtime\n  * some wrapper in your stack mutates the tokenizer config\n\n\n\nThis check is cheap and important.\n\n* * *\n\n### 5) FP16 and 8-bit optimizer choices may be amplifying the problem\n\nPyTorch’s AMP docs warn that fp16 does **not** work for every model, and explicitly note that many bf16-pretrained models cannot operate safely in fp16’s much smaller numerical range, which can cause gradient overflow. PyTorch and Transformers both point out that bf16 has a much larger dynamic range than fp16 and is generally the safer mixed-precision mode when hardware supports it. (PyTorch Docs)\n\nbitsandbytes’ docs say 8-bit optimizers are most beneficial when memory pressure comes from **many trainable parameters** , and recommend `StableEmbedding` for NLP stability. In a LoRA setup, you are only optimizing a relatively small adapter set, so the upside of `adamw_8bit` is usually smaller than in full-parameter training. That makes it a poor choice for **debugging** stability, because it adds another quantized component without buying you as much. (Hugging Face)\n\nSo I would treat precision and optimizer as **amplifiers** , not the root cause:\n\n  * bad batch or bad masking creates ugly gradients\n  * fp16 / 8-bit optimizer makes the ugliness more visible\n  * the visible symptom becomes “loss explosion”\n\n\n\n* * *\n\n### 6) Added special tokens can break LoRA training unless embeddings are also trainable\n\nThe Qwen repo explicitly warns that if your training introduces **new special tokens** during LoRA fine-tuning, you need to make the relevant layers trainable via `modules_to_save`; otherwise the model may not learn those tokens properly. (GitHub)\n\nThis matters if you introduced any custom markers such as:\n\n  * `<analysis>`\n  * `<reasoning>`\n  * custom teacher separators\n  * synthetic `<final>` tags\n  * any delimiter not already native to the checkpoint/tokenizer\n\n\n\nIf you did, remove them first for debugging, or train the relevant embeddings/output layers correctly.\n\n* * *\n\n## What I think is happening in **your** case\n\nMy best current explanation is:\n\n> a subset of your reasoning-heavy mixed dataset is being converted into a Qwen3.5 training sequence incorrectly, and when those malformed or weakly supervised examples hit, gradients spike; lower LR only postpones that encounter.\n\nThat explanation fits:\n\n  * the shape of the curves you shared\n  * known TRL masking/truncation behavior\n  * known Qwen chat-template pitfalls\n  * known Qwen 4-bit pad-token bugs\n  * the fact that reducing LR delays rather than cures the failure (GitHub)\n\n\n\n* * *\n\n## What I would do, in order\n\n### 1) Run the most boring possible debug configuration\n\nUse this first:\n\n\n    per_device_train_batch_size = 1\n    gradient_accumulation_steps = 4 or 8\n    learning_rate = 5e-5\n    warmup_ratio = 0.03\n    max_grad_norm = 0.5\n    weight_decay = 0.0\n\n    packing = False\n    assistant_only_loss = False\n    completion_only_loss = False  # if applicable\n    group_by_length = False\n    max_length = 1024  # maybe 2048 later\n\n    optim = \"adamw_torch\"\n    bf16 = True   # if hardware supports it\n    fp16 = False\n\n\nWhy this setup:\n\n  * `assistant_only_loss=False` removes one major masking failure mode while debugging\n  * `packing=False` removes packed-sequence boundary problems\n  * plain AdamW removes 8-bit optimizer noise\n  * bf16 reduces overflow risk\n  * shorter context reduces truncation pressure\n  * `5e-5` is conservative but still reasonable for LoRA SFT once the data path is correct (GitHub)\n\n\n\nDo **not** optimize for speed yet. Optimize for interpretability.\n\n* * *\n\n### 2) Overfit a tiny, hand-cleaned subset\n\nTake 64–128 examples and inspect them manually.\n\nKeep only rows that are:\n\n  * one clean user turn\n  * one clean assistant turn\n  * no duplicate special tokens\n  * no weird teacher artifacts\n  * no ultra-long rambling reasoning block\n  * no custom tokens unless you truly need them\n\n\n\nIf this subset trains cleanly, your framework is probably fine and the larger dataset contains toxic rows. If this subset still blows up, the problem is more likely tokenizer / template / precision. That is the fastest split between “data problem” and “stack problem.”\n\n* * *\n\n### 3) Normalize the dataset into **one mode**\n\nDo **not** train on a random soup of teacher traces.\n\nPick one:\n\n#### Mode A: thinking training\n\nNormalize every assistant response into one consistent structure, for example:\n\n  * reasoning block\n  * final answer\n\n\n\n#### Mode B: non-thinking training\n\nStrip chain-of-thought and keep only the final answer.\n\nThis recommendation follows directly from Qwen’s own training guidance. Their docs say that if you fine-tune with data that lacks chain-of-thought but want to preserve reasoning ability, you should handle that explicitly with `ignore_empty_think` or a non-thinking instruction/prefix. The Qwen3.5 examples similarly use `add_non_thinking_prefix` and `ignore_empty_think` in the fine-tuning recipe. (Qwen)\n\nFor your use case, I would personally start with **non-thinking training** first. Mixed external reasoning traces are a harder target to get right.\n\n* * *\n\n### 4) Inspect supervision density on real batches\n\nFor several batches, print:\n\n  * total sequence length\n  * number of labels not equal to `-100`\n  * first/last supervised tokens after masking\n  * whether the assistant answer survives truncation\n\n\n\nExample:\n\n\n    batch = next(iter(trainer.get_train_dataloader()))\n    labels = batch[\"labels\"]\n    counts = (labels != -100).sum(dim=1)\n    print(\"supervised token counts:\", counts.tolist())\n\n    for i in range(min(4, labels.size(0))):\n        kept = labels[i][labels[i] != -100]\n        print(f\"\\nExample {i}: {kept.numel()} supervised tokens\")\n        if kept.numel():\n            print(tokenizer.decode(kept[:120], skip_special_tokens=False))\n\n\nIf you see examples with almost no supervised tokens, or only fragments of a reasoning scaffold, you likely found the trigger. This is exactly the family of failure described in the TRL masking/truncation issue. (GitHub)\n\n* * *\n\n### 5) Verify the tokenizer configuration explicitly\n\nPrint this once at startup:\n\n\n    print(\"pad_token:\", tokenizer.pad_token, tokenizer.pad_token_id)\n    print(\"eos_token:\", tokenizer.eos_token, tokenizer.eos_token_id)\n    print(\"bos_token:\", tokenizer.bos_token, tokenizer.bos_token_id)\n    print(\"model pad_token_id:\", model.config.pad_token_id)\n    print(\"model eos_token_id:\", model.config.eos_token_id)\n    print(\"special_tokens_map:\", tokenizer.special_tokens_map)\n\n\nWhat to look for:\n\n  * wrong pad token, especially `'<|vision_pad|>'`\n  * unexpected EOS/token-end behavior\n  * duplicated or custom special tokens you forgot about\n\n\n\nThe pad-token check is especially important if you are using a recent Qwen 4-bit stack. (GitHub)\n\n* * *\n\n### 6) Verify you are not double-applying the chat template\n\nBad pattern:\n\n\n    text = tokenizer.apply_chat_template(messages, tokenize=False)\n    enc = tokenizer(text, add_special_tokens=True, return_tensors=\"pt\")\n\n\nSafer patterns:\n\n\n    enc = tokenizer.apply_chat_template(messages, tokenize=True, return_tensors=\"pt\")\n\n\nor\n\n\n    text = tokenizer.apply_chat_template(messages, tokenize=False)\n    enc = tokenizer(text, add_special_tokens=False, return_tensors=\"pt\")\n\n\nTransformers’ docs are explicit that adding special tokens again after chat templating can duplicate them and hurt training. (Hugging Face)\n\n* * *\n\n### 7) Temporarily stop using assistant-only loss\n\nWhile debugging, turn it off.\n\nReason: TRL clearly states that assistant-only loss depends on chat templates that support returning assistant masks correctly, and we already know truncation can silently zero out those masks. Until you verify the data path, the extra selectivity is not worth the risk. (GitHub)\n\nOnce the boring run is stable, re-enable it.\n\n* * *\n\n### 8) If you added any special tokens, either remove them or train embeddings properly\n\nIf your preprocessing added new delimiters, either:\n\n  * remove them completely for the first stable run, or\n  * make the needed embedding/output layers trainable as Qwen recommends\n\n\n\nThis is a less common cause than masking/template issues, but when it is present it can quietly poison training. (GitHub)\n\n* * *\n\n## A practical decision tree\n\n### If the run blows up at roughly the **same step** every time\n\nThat usually means a **specific toxic region of the dataset**.\n\nAction:\n\n  * log dataset indices per batch\n  * dump the batch right before the failure\n  * decode the exact formatted text and labels\n\n\n\n### If it blows up at **different steps** each run\n\nThat leans more toward a **numerical issue** :\n\n  * fp16 overflow\n  * wrong pad token\n  * 8-bit optimizer sensitivity\n  * packing/padding nondeterminism\n\n\n\nAction:\n\n  * switch to bf16\n  * switch to plain AdamW\n  * disable packing\n  * keep batch size at 1\n  * re-check tokenizer special tokens\n\n\n\n* * *\n\n## What I would **not** focus on first\n\nI would **not** spend the first day sweeping LR from `1e-4` to `1e-6`.\n\nWhy not: both TRL and Qwen’s own example configs show that LoRA SFT commonly lives around `1e-4` once the data/template path is correct. If your run is exploding even after dropping well below that, LR is more likely the **volume knob** than the broken component. (GitHub)\n\n* * *\n\n## My probability ranking\n\nFor your case, I would rank the causes like this:\n\n  1. **dataset format mismatch / inconsistent teacher reasoning traces**\n  2. **assistant masking or truncation removing the actual supervised target**\n  3. **double chat templating / duplicate special tokens / extra EOS**\n  4. **wrong pad token in the 4-bit stack**\n  5. **fp16 or 8-bit optimizer amplifying the above**\n  6. **raw LR as the primary cause**\n\n\n\n* * *\n\n## The shortest version of my advice\n\nStart with this exact sequence:\n\n  1. **bf16 + plain AdamW + no packing + no assistant-only loss**\n\n  2. **tiny hand-cleaned subset**\n\n  3. **inspect labels and non-`-100` counts**\n\n  4. **verify pad/EOS/chat-template behavior**\n\n  5. **normalize the dataset into one consistent thinking or non-thinking format**\n\n  6. only then bring back:\n\n     * longer context\n     * assistant-only loss\n     * 8-bit optimizer\n     * the full mixed dataset\n\n\n\nThat is the path most likely to turn your current “explosion” into a reproducible, debuggable failure.",
  "title": "Qwen3.5-4B loss exploding"
}