{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreia2wesn7sjyw26slprhdwodveb5hjo6dwsmqh7ufughi6min7is5q",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mh36zqn6vjm2"
  },
  "path": "/t/qwen3-5-4b-loss-exploding/174057#post_5",
  "publishedAt": "2026-03-14T20:20:35.000Z",
  "site": "https://discuss.huggingface.co",
  "textContent": "This is all I got!\n\nHappy Pi Day!  3.14159… and Einstein’s birthday too!\n\nLooking at your Qwen training explosion issue, here are some **additional insights** beyond what’s already been discussed:\n\n##  **Often Overlooked Issues:**\n\n### 1. **Sequence Length Truncation**\n\n\n    # Check if your long reasoning traces are getting brutally truncated\n    max_length = 1024  # This might be TOO SHORT for Claude/Gemini reasoning\n    # Try: 2048 or 4096 if memory allows\n\n    # Check truncation rate:\n    truncated = sum(1 for sample in dataset if len(tokenizer.encode(sample)) > max_length)\n    print(f\"Truncated samples: {truncated}/{len(dataset)} ({100*truncated/len(dataset):.1f}%)\")\n\n\n### 2. **Label Shift Alignment**\n\n\n    # Qwen might expect labels shifted by 1 position\n    # Verify your labels align with predictions correctly\n    input_ids = batch[\"input_ids\"]\n    labels = batch[\"labels\"]\n\n    # Check if labels are properly shifted\n    assert labels.shape == input_ids.shape, \"Shape mismatch!\"\n\n\n### 3. **Gradient Accumulation Gotcha**\n\n\n    # With gradient_accumulation_steps=4 and batch_size=1,\n    # your effective batch is 4, but gradient noise is HIGHER\n    # Try: batch_size=2, accumulation_steps=2 (same effective batch, less noise)\n\n\n##  **Quick Diagnostic Script:**\n\n\n    def diagnose_training_batch(batch, tokenizer):\n        \"\"\"Run this on a few batches before training\"\"\"\n        labels = batch[\"labels\"]\n        input_ids = batch[\"input_ids\"]\n\n        print(\"=== BATCH DIAGNOSTICS ===\")\n        print(f\"Batch size: {labels.shape[0]}\")\n        print(f\"Sequence length: {labels.shape[1]}\")\n\n        # Check supervision ratio\n        supervised_tokens = (labels != -100).sum().item()\n        total_tokens = labels.numel()\n        print(f\"Supervision ratio: {supervised_tokens/total_tokens:.2%}\")\n\n        # Check for all -100 rows (no supervision!)\n        no_supervision = (labels == -100).all(dim=1).sum().item()\n        print(f\"Samples with NO supervision: {no_supervision}\")\n\n        # Check for NaN/Inf in inputs\n        print(f\"NaN in input_ids: {torch.isnan(input_ids).any().item()}\")\n        print(f\"Inf in input_ids: {torch.isinf(input_ids).any().item()}\")\n\n        # Decode first supervised tokens\n        for i in range(min(2, labels.shape[0])):\n            kept = labels[i][labels[i] != -100]\n            if kept.numel() > 0:\n                print(f\"\\nSample {i} supervised text:\")\n                print(tokenizer.decode(kept[:50], skip_special_tokens=False))\n\n\n##  **Qwen3.5-Specific Fix:**\n\nSince Qwen3.5 has **built-in thinking** , you might need to:\n\n\n    # Option 1: Strip thinking from your training data\n    def strip_thinking(response):\n        \"\"\"Remove <think> tags if present\"\"\"\n        import re\n        return re.sub(r'<think>.*?</think>', '', response, flags=re.DOTALL)\n\n    # Option 2: Or embrace it - keep thinking but format correctly\n    def format_for_qwen_thinking(user_msg, reasoning, final_answer):\n        return [\n            {\"role\": \"user\", \"content\": user_msg},\n            {\"role\": \"assistant\", \"content\": f\"<think>{reasoning}</think>{final_answer}\"}\n        ]\n\n\n##  **Learning Rate Schedule Tweak:**\n\n\n    # Instead of just warmup_ratio=0.03, try:\n    warmup_steps = 100  # Fixed steps instead of ratio\n    lr_scheduler_type = \"cosine_with_restarts\"  # More stable than linear\n\n\n##  **Emergency Stabilization:**\n\nIf it’s still exploding:\n\n\n    # Nuclear options:\n    max_grad_norm = 0.1  # More aggressive clipping\n    optim = \"adamw_torch_fused\"  # More stable\n    gradient_checkpointing = True  # Reduce memory pressure\n\n\n**Want me to write a complete preprocessing pipeline** that:\n\n  1. Loads your shuffled Claude/Gemini dataset\n  2. Normalizes formatting to Qwen’s expectations\n  3. Applies proper chat templates\n  4. Validates supervision before training?\n\n\n\nJust say the word! And may your gradients be stable like π is constant!\n\n# Check if your long reasoning traces are getting brutally truncated\n\nmax_length = 1024 # This might be TOO SHORT for Claude/Gemini reasoning\n\n# Try: 2048 or 4096 if memory allows\n\n# Check truncation rate:\n\ntruncated = sum(1 for sample in dataset if len(tokenizer.encode(sample)) > max_length)\nprint(f\"Truncated samples: {truncated}/{len(dataset)} ({100*truncated/len(dataset):.1f}%)\")\n\n# Qwen might expect labels shifted by 1 position\n\n# Verify your labels align with predictions correctly\n\ninput_ids = batch[“input_ids”]\nlabels = batch[“labels”]\n\n# Check if labels are properly shifted\n\nassert labels.shape == input_ids.shape, “Shape mismatch!”\n\n# With gradient_accumulation_steps=4 and batch_size=1,\n\n# your effective batch is 4, but gradient noise is HIGHER\n\n# Try: batch_size=2, accumulation_steps=2 (same effective batch, less noise)\n\ndef diagnose_training_batch(batch, tokenizer):\n“”“Run this on a few batches before training”“”\nlabels = batch[“labels”]\ninput_ids = batch[“input_ids”]\n\n\n    print(\"=== BATCH DIAGNOSTICS ===\")\n    print(f\"Batch size: {labels.shape[0]}\")\n    print(f\"Sequence length: {labels.shape[1]}\")\n\n    # Check supervision ratio\n    supervised_tokens = (labels != -100).sum().item()\n    total_tokens = labels.numel()\n    print(f\"Supervision ratio: {supervised_tokens/total_tokens:.2%}\")\n\n    # Check for all -100 rows (no supervision!)\n    no_supervision = (labels == -100).all(dim=1).sum().item()\n    print(f\"Samples with NO supervision: {no_supervision}\")\n\n    # Check for NaN/Inf in inputs\n    print(f\"NaN in input_ids: {torch.isnan(input_ids).any().item()}\")\n    print(f\"Inf in input_ids: {torch.isinf(input_ids).any().item()}\")\n\n    # Decode first supervised tokens\n    for i in range(min(2, labels.shape[0])):\n        kept = labels[i][labels[i] != -100]\n        if kept.numel() > 0:\n            print(f\"\\nSample {i} supervised text:\")\n            print(tokenizer.decode(kept[:50], skip_special_tokens=False))\n\n\n# Option 1: Strip thinking from your training data\n\ndef strip_thinking(response):\n“”“Remove tags if present”“”\nimport re\nreturn re.sub(r’.*?', ‘’, response, flags=re.DOTALL)\n\n# Option 2: Or embrace it - keep thinking but format correctly\n\ndef format_for_qwen_thinking(user_msg, reasoning, final_answer):\nreturn [\n{“role”: “user”, “content”: user_msg},\n{“role”: “assistant”, “content”: f\"{reasoning}{final_answer}\"}\n]\n\n# Instead of just warmup_ratio=0.03, try:\n\nwarmup_steps = 100 # Fixed steps instead of ratio\nlr_scheduler_type = “cosine_with_restarts” # More stable than linear\n\n# Nuclear options:\n\nmax_grad_norm = 0.1 # More aggressive clipping\noptim = “adamw_torch_fused” # More stable\ngradient_checkpointing = True # Reduce memory pressure\n\nYes That would be perfect Do The Best You Can! Chat Qwen Ai!\n\nYou got it, Eric!  Let’s build you a **complete, battle-tested preprocessing pipeline** tailored for your shuffled Claude + Gemini reasoning dataset — designed to **stop the loss explosions** , align with **Qwen3.5’s “think-then-answer” template** , and give you clean, stable supervision tokens.\n\nWe’ll do this in 4 phases:\n\n* * *\n\n##  PHASE 1: LOAD & INSPECT YOUR DATASET\n\n\n    import json\n    import pandas as pd\n    from datasets import load_dataset\n    from transformers import AutoTokenizer\n\n    # Load your dataset (adjust path or source as needed)\n    # Example: if it's a JSONL file on Hugging Face Hub or local\n    dataset = load_dataset(\"json\", data_files=\"your_shuffled_reasoning_data.jsonl\", split=\"train\")\n\n    # OR if it's already a HF dataset:\n    # dataset = load_dataset(\"your_username/your_dataset_name\")\n\n    print(f\"Loaded {len(dataset)} samples\")\n    print(\"Sample raw entry:\")\n    print(dataset[0])\n\n\n* * *\n\n##  PHASE 2: NORMALIZE FORMAT TO QWEN’S EXPECTATIONS\n\nQwen3.5 expects:\n\n  * User message → Assistant thinks → Assistant answers\n  * But during fine-tuning, we usually want to supervise ONLY the final answer (or optionally include thinking if formatted correctly)\n\n\n\nWe’ll assume you want to train on **final answers only** (safest for stability). If you want to include reasoning, I’ll show you how later.\n\n\n    def normalize_to_qwen_format(example):\n        \"\"\"\n        Converts mixed Claude/Gemini outputs into Qwen-compatible chat format.\n        Assumes each sample has 'messages' key with list of {role, content} dicts.\n        We'll extract only the LAST assistant response as the target.\n        \"\"\"\n        messages = example.get(\"messages\", [])\n\n        # Filter out system messages if any\n        user_msgs = [m for m in messages if m[\"role\"] == \"user\"]\n        assistant_msgs = [m for m in messages if m[\"role\"] == \"assistant\"]\n\n        if not user_msgs or not assistant_msgs:\n            return None  # Skip invalid samples\n\n        # Take last user prompt and last assistant response\n        last_user = user_msgs[-1][\"content\"]\n        last_assistant = assistant_msgs[-1][\"content\"]\n\n        # Build Qwen-style chat template input\n        qwen_messages = [\n            {\"role\": \"user\", \"content\": last_user},\n            {\"role\": \"assistant\", \"content\": last_assistant}\n        ]\n\n        return {\n            \"chat_template_input\": qwen_messages,\n            \"target_text\": last_assistant  # For debugging/validation\n        }\n\n    # Apply normalization\n    normalized_dataset = dataset.map(normalize_to_qwen_format, remove_columns=dataset.column_names)\n    normalized_dataset = normalized_dataset.filter(lambda x: x is not None)\n\n    print(f\"After normalization: {len(normalized_dataset)} valid samples\")\n    print(\"Normalized sample:\")\n    print(normalized_dataset[0])\n\n\n> _If you want to INCLUDE reasoning steps (e.g., “Let me think step by step…”), let me know — we can preserve them but wrap them in`<thought>` tags or similar, depending on Qwen’s current spec._\n\n* * *\n\n##  PHASE 3: APPLY CHAT TEMPLATE + TOKENIZE WITH PROPER MASKING\n\nThis is where most explosions happen — improper labeling or tokenization.\n\n\n    model_name = \"Qwen/Qwen2.5-7B-Instruct\"  # Or whatever Qwen variant you're using\n    tokenizer = AutoTokenizer.from_pretrained(model_name)\n\n    # Ensure special tokens are set correctly\n    if tokenizer.pad_token is None:\n        tokenizer.pad_token = tokenizer.eos_token\n\n    def tokenize_and_mask(example):\n        messages = example[\"chat_template_input\"]\n\n        # Apply chat template — this adds all necessary special tokens\n        text = tokenizer.apply_chat_template(\n            messages,\n            tokenize=False,\n            add_generation_prompt=False  # We’re training, not generating\n        )\n\n        # Tokenize\n        enc = tokenizer(\n            text,\n            truncation=True,\n            max_length=1024,  # Adjust based on your VRAM\n            padding=False,\n            return_tensors=None\n        )\n\n        # Now create labels: mask everything except assistant response\n        # Find where assistant response starts\n        assistant_start_marker = \"assistant\\n\"\n        assistant_end_marker = \"\n\n    import json\n    import pandas as pd\n    from datasets import load_dataset\n    from transformers import AutoTokenizer\n\n    # Load your dataset (adjust path or source as needed)\n    # Example: if it's a JSONL file on Hugging Face Hub or local\n    dataset = load_dataset(\"json\", data_files=\"your_shuffled_reasoning_data.jsonl\", split=\"train\")\n\n    # OR if it's already a HF dataset:\n    # dataset = load_dataset(\"your_username/your_dataset_name\")\n\n    print(f\"Loaded {len(dataset)} samples\")\n    print(\"Sample raw entry:\")\n    print(dataset[0])\n\n    def normalize_to_qwen_format(example):\n        \"\"\"\n        Converts mixed Claude/Gemini outputs into Qwen-compatible chat format.\n        Assumes each sample has 'messages' key with list of {role, content} dicts.\n        We'll extract only the LAST assistant response as the target.\n        \"\"\"\n        messages = example.get(\"messages\", [])\n\n        # Filter out system messages if any\n        user_msgs = [m for m in messages if m[\"role\"] == \"user\"]\n        assistant_msgs = [m for m in messages if m[\"role\"] == \"assistant\"]\n\n        if not user_msgs or not assistant_msgs:\n            return None  # Skip invalid samples\n\n        # Take last user prompt and last assistant response\n        last_user = user_msgs[-1][\"content\"]\n        last_assistant = assistant_msgs[-1][\"content\"]\n\n        # Build Qwen-style chat template input\n        qwen_messages = [\n            {\"role\": \"user\", \"content\": last_user},\n            {\"role\": \"assistant\", \"content\": last_assistant}\n        ]\n\n        return {\n            \"chat_template_input\": qwen_messages,\n            \"target_text\": last_assistant  # For debugging/validation\n        }\n\n    # Apply normalization\n    normalized_dataset = dataset.map(normalize_to_qwen_format, remove_columns=dataset.column_names)\n    normalized_dataset = normalized_dataset.filter(lambda x: x is not None)\n\n    print(f\"After normalization: {len(normalized_dataset)} valid samples\")\n    print(\"Normalized sample:\")\n    print(normalized_dataset[0])\n\n    model_name = \"Qwen/Qwen2.5-7B-Instruct\"  # Or whatever Qwen variant you're using\n    tokenizer = AutoTokenizer.from_pretrained(model_name)\n\n    # Ensure special tokens are set correctly\n    if tokenizer.pad_token is None:\n        tokenizer.pad_token = tokenizer.eos_token\n\n    def tokenize_and_mask(example):\n        messages = example[\"chat_template_input\"]\n\n        # Apply chat template — this adds all necessary special tokens\n        text = tokenizer.apply_chat_template(\n            messages,\n            tokenize=False,\n            add_generation_prompt=False  # We’re training, not generating\n        )\n\n        # Tokenize\n        enc = tokenizer(\n            text,\n            truncation=True,\n            max_length=1024,  # Adjust based on your VRAM\n            padding=False,\n            return_tensors=None\n        )\n\n        # Now create labels: mask everything except assistant response\n        # Find where assistant response starts\n        assistant_start_marker = \"assistant\\n\"\n        assistant_end_marker = \"\n",
  "title": "Qwen3.5-4B loss exploding"
}