{
"$type": "site.standard.document",
"bskyPostRef": {
"cid": "bafyreia2wesn7sjyw26slprhdwodveb5hjo6dwsmqh7ufughi6min7is5q",
"uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mh2rlmpgtou2"
},
"path": "/t/qwen3-5-4b-loss-exploding/174057#post_5",
"publishedAt": "2026-03-14T20:20:35.000Z",
"site": "https://discuss.huggingface.co",
"textContent": "This is all I got!\n\nHappy Pi Day! 3.14159… and Einstein’s birthday too!\n\nLooking at your Qwen training explosion issue, here are some **additional insights** beyond what’s already been discussed:\n\n## **Often Overlooked Issues:**\n\n### 1. **Sequence Length Truncation**\n\n\n # Check if your long reasoning traces are getting brutally truncated\n max_length = 1024 # This might be TOO SHORT for Claude/Gemini reasoning\n # Try: 2048 or 4096 if memory allows\n\n # Check truncation rate:\n truncated = sum(1 for sample in dataset if len(tokenizer.encode(sample)) > max_length)\n print(f\"Truncated samples: {truncated}/{len(dataset)} ({100*truncated/len(dataset):.1f}%)\")\n\n\n### 2. **Label Shift Alignment**\n\n\n # Qwen might expect labels shifted by 1 position\n # Verify your labels align with predictions correctly\n input_ids = batch[\"input_ids\"]\n labels = batch[\"labels\"]\n\n # Check if labels are properly shifted\n assert labels.shape == input_ids.shape, \"Shape mismatch!\"\n\n\n### 3. **Gradient Accumulation Gotcha**\n\n\n # With gradient_accumulation_steps=4 and batch_size=1,\n # your effective batch is 4, but gradient noise is HIGHER\n # Try: batch_size=2, accumulation_steps=2 (same effective batch, less noise)\n\n\n## **Quick Diagnostic Script:**\n\n\n def diagnose_training_batch(batch, tokenizer):\n \"\"\"Run this on a few batches before training\"\"\"\n labels = batch[\"labels\"]\n input_ids = batch[\"input_ids\"]\n\n print(\"=== BATCH DIAGNOSTICS ===\")\n print(f\"Batch size: {labels.shape[0]}\")\n print(f\"Sequence length: {labels.shape[1]}\")\n\n # Check supervision ratio\n supervised_tokens = (labels != -100).sum().item()\n total_tokens = labels.numel()\n print(f\"Supervision ratio: {supervised_tokens/total_tokens:.2%}\")\n\n # Check for all -100 rows (no supervision!)\n no_supervision = (labels == -100).all(dim=1).sum().item()\n print(f\"Samples with NO supervision: {no_supervision}\")\n\n # Check for NaN/Inf in inputs\n print(f\"NaN in input_ids: {torch.isnan(input_ids).any().item()}\")\n print(f\"Inf in input_ids: {torch.isinf(input_ids).any().item()}\")\n\n # Decode first supervised tokens\n for i in range(min(2, labels.shape[0])):\n kept = labels[i][labels[i] != -100]\n if kept.numel() > 0:\n print(f\"\\nSample {i} supervised text:\")\n print(tokenizer.decode(kept[:50], skip_special_tokens=False))\n\n\n## **Qwen3.5-Specific Fix:**\n\nSince Qwen3.5 has **built-in thinking** , you might need to:\n\n\n # Option 1: Strip thinking from your training data\n def strip_thinking(response):\n \"\"\"Remove <think> tags if present\"\"\"\n import re\n return re.sub(r'<think>.*?</think>', '', response, flags=re.DOTALL)\n\n # Option 2: Or embrace it - keep thinking but format correctly\n def format_for_qwen_thinking(user_msg, reasoning, final_answer):\n return [\n {\"role\": \"user\", \"content\": user_msg},\n {\"role\": \"assistant\", \"content\": f\"<think>{reasoning}</think>{final_answer}\"}\n ]\n\n\n## **Learning Rate Schedule Tweak:**\n\n\n # Instead of just warmup_ratio=0.03, try:\n warmup_steps = 100 # Fixed steps instead of ratio\n lr_scheduler_type = \"cosine_with_restarts\" # More stable than linear\n\n\n## **Emergency Stabilization:**\n\nIf it’s still exploding:\n\n\n # Nuclear options:\n max_grad_norm = 0.1 # More aggressive clipping\n optim = \"adamw_torch_fused\" # More stable\n gradient_checkpointing = True # Reduce memory pressure\n\n\n**Want me to write a complete preprocessing pipeline** that:\n\n 1. Loads your shuffled Claude/Gemini dataset\n 2. Normalizes formatting to Qwen’s expectations\n 3. Applies proper chat templates\n 4. Validates supervision before training?\n\n\n\nJust say the word! And may your gradients be stable like π is constant!\n\n# Check if your long reasoning traces are getting brutally truncated\n\nmax_length = 1024 # This might be TOO SHORT for Claude/Gemini reasoning\n\n# Try: 2048 or 4096 if memory allows\n\n# Check truncation rate:\n\ntruncated = sum(1 for sample in dataset if len(tokenizer.encode(sample)) > max_length)\nprint(f\"Truncated samples: {truncated}/{len(dataset)} ({100*truncated/len(dataset):.1f}%)\")\n\n# Qwen might expect labels shifted by 1 position\n\n# Verify your labels align with predictions correctly\n\ninput_ids = batch[“input_ids”]\nlabels = batch[“labels”]\n\n# Check if labels are properly shifted\n\nassert labels.shape == input_ids.shape, “Shape mismatch!”\n\n# With gradient_accumulation_steps=4 and batch_size=1,\n\n# your effective batch is 4, but gradient noise is HIGHER\n\n# Try: batch_size=2, accumulation_steps=2 (same effective batch, less noise)\n\ndef diagnose_training_batch(batch, tokenizer):\n“”“Run this on a few batches before training”“”\nlabels = batch[“labels”]\ninput_ids = batch[“input_ids”]\n\n\n print(\"=== BATCH DIAGNOSTICS ===\")\n print(f\"Batch size: {labels.shape[0]}\")\n print(f\"Sequence length: {labels.shape[1]}\")\n\n # Check supervision ratio\n supervised_tokens = (labels != -100).sum().item()\n total_tokens = labels.numel()\n print(f\"Supervision ratio: {supervised_tokens/total_tokens:.2%}\")\n\n # Check for all -100 rows (no supervision!)\n no_supervision = (labels == -100).all(dim=1).sum().item()\n print(f\"Samples with NO supervision: {no_supervision}\")\n\n # Check for NaN/Inf in inputs\n print(f\"NaN in input_ids: {torch.isnan(input_ids).any().item()}\")\n print(f\"Inf in input_ids: {torch.isinf(input_ids).any().item()}\")\n\n # Decode first supervised tokens\n for i in range(min(2, labels.shape[0])):\n kept = labels[i][labels[i] != -100]\n if kept.numel() > 0:\n print(f\"\\nSample {i} supervised text:\")\n print(tokenizer.decode(kept[:50], skip_special_tokens=False))\n\n\n# Option 1: Strip thinking from your training data\n\ndef strip_thinking(response):\n“”“Remove tags if present”“”\nimport re\nreturn re.sub(r’.*?', ‘’, response, flags=re.DOTALL)\n\n# Option 2: Or embrace it - keep thinking but format correctly\n\ndef format_for_qwen_thinking(user_msg, reasoning, final_answer):\nreturn [\n{“role”: “user”, “content”: user_msg},\n{“role”: “assistant”, “content”: f\"{reasoning}{final_answer}\"}\n]\n\n# Instead of just warmup_ratio=0.03, try:\n\nwarmup_steps = 100 # Fixed steps instead of ratio\nlr_scheduler_type = “cosine_with_restarts” # More stable than linear\n\n# Nuclear options:\n\nmax_grad_norm = 0.1 # More aggressive clipping\noptim = “adamw_torch_fused” # More stable\ngradient_checkpointing = True # Reduce memory pressure\n\nYes That would be perfect Do The Best You Can! Chat Qwen Ai!\n\nYou got it, Eric! Let’s build you a **complete, battle-tested preprocessing pipeline** tailored for your shuffled Claude + Gemini reasoning dataset — designed to **stop the loss explosions** , align with **Qwen3.5’s “think-then-answer” template** , and give you clean, stable supervision tokens.\n\nWe’ll do this in 4 phases:\n\n* * *\n\n## PHASE 1: LOAD & INSPECT YOUR DATASET\n\n\n import json\n import pandas as pd\n from datasets import load_dataset\n from transformers import AutoTokenizer\n\n # Load your dataset (adjust path or source as needed)\n # Example: if it's a JSONL file on Hugging Face Hub or local\n dataset = load_dataset(\"json\", data_files=\"your_shuffled_reasoning_data.jsonl\", split=\"train\")\n\n # OR if it's already a HF dataset:\n # dataset = load_dataset(\"your_username/your_dataset_name\")\n\n print(f\"Loaded {len(dataset)} samples\")\n print(\"Sample raw entry:\")\n print(dataset[0])\n\n\n* * *\n\n## PHASE 2: NORMALIZE FORMAT TO QWEN’S EXPECTATIONS\n\nQwen3.5 expects:\n\n * User message → Assistant thinks → Assistant answers\n * But during fine-tuning, we usually want to supervise ONLY the final answer (or optionally include thinking if formatted correctly)\n\n\n\nWe’ll assume you want to train on **final answers only** (safest for stability). If you want to include reasoning, I’ll show you how later.\n\n\n def normalize_to_qwen_format(example):\n \"\"\"\n Converts mixed Claude/Gemini outputs into Qwen-compatible chat format.\n Assumes each sample has 'messages' key with list of {role, content} dicts.\n We'll extract only the LAST assistant response as the target.\n \"\"\"\n messages = example.get(\"messages\", [])\n\n # Filter out system messages if any\n user_msgs = [m for m in messages if m[\"role\"] == \"user\"]\n assistant_msgs = [m for m in messages if m[\"role\"] == \"assistant\"]\n\n if not user_msgs or not assistant_msgs:\n return None # Skip invalid samples\n\n # Take last user prompt and last assistant response\n last_user = user_msgs[-1][\"content\"]\n last_assistant = assistant_msgs[-1][\"content\"]\n\n # Build Qwen-style chat template input\n qwen_messages = [\n {\"role\": \"user\", \"content\": last_user},\n {\"role\": \"assistant\", \"content\": last_assistant}\n ]\n\n return {\n \"chat_template_input\": qwen_messages,\n \"target_text\": last_assistant # For debugging/validation\n }\n\n # Apply normalization\n normalized_dataset = dataset.map(normalize_to_qwen_format, remove_columns=dataset.column_names)\n normalized_dataset = normalized_dataset.filter(lambda x: x is not None)\n\n print(f\"After normalization: {len(normalized_dataset)} valid samples\")\n print(\"Normalized sample:\")\n print(normalized_dataset[0])\n\n\n> _If you want to INCLUDE reasoning steps (e.g., “Let me think step by step…”), let me know — we can preserve them but wrap them in`<thought>` tags or similar, depending on Qwen’s current spec._\n\n* * *\n\n## PHASE 3: APPLY CHAT TEMPLATE + TOKENIZE WITH PROPER MASKING\n\nThis is where most explosions happen — improper labeling or tokenization.\n\n\n model_name = \"Qwen/Qwen2.5-7B-Instruct\" # Or whatever Qwen variant you're using\n tokenizer = AutoTokenizer.from_pretrained(model_name)\n\n # Ensure special tokens are set correctly\n if tokenizer.pad_token is None:\n tokenizer.pad_token = tokenizer.eos_token\n\n def tokenize_and_mask(example):\n messages = example[\"chat_template_input\"]\n\n # Apply chat template — this adds all necessary special tokens\n text = tokenizer.apply_chat_template(\n messages,\n tokenize=False,\n add_generation_prompt=False # We’re training, not generating\n )\n\n # Tokenize\n enc = tokenizer(\n text,\n truncation=True,\n max_length=1024, # Adjust based on your VRAM\n padding=False,\n return_tensors=None\n )\n\n # Now create labels: mask everything except assistant response\n # Find where assistant response starts\n assistant_start_marker = \"assistant\\n\"\n assistant_end_marker = \"\n\n import json\n import pandas as pd\n from datasets import load_dataset\n from transformers import AutoTokenizer\n\n # Load your dataset (adjust path or source as needed)\n # Example: if it's a JSONL file on Hugging Face Hub or local\n dataset = load_dataset(\"json\", data_files=\"your_shuffled_reasoning_data.jsonl\", split=\"train\")\n\n # OR if it's already a HF dataset:\n # dataset = load_dataset(\"your_username/your_dataset_name\")\n\n print(f\"Loaded {len(dataset)} samples\")\n print(\"Sample raw entry:\")\n print(dataset[0])\n\n def normalize_to_qwen_format(example):\n \"\"\"\n Converts mixed Claude/Gemini outputs into Qwen-compatible chat format.\n Assumes each sample has 'messages' key with list of {role, content} dicts.\n We'll extract only the LAST assistant response as the target.\n \"\"\"\n messages = example.get(\"messages\", [])\n\n # Filter out system messages if any\n user_msgs = [m for m in messages if m[\"role\"] == \"user\"]\n assistant_msgs = [m for m in messages if m[\"role\"] == \"assistant\"]\n\n if not user_msgs or not assistant_msgs:\n return None # Skip invalid samples\n\n # Take last user prompt and last assistant response\n last_user = user_msgs[-1][\"content\"]\n last_assistant = assistant_msgs[-1][\"content\"]\n\n # Build Qwen-style chat template input\n qwen_messages = [\n {\"role\": \"user\", \"content\": last_user},\n {\"role\": \"assistant\", \"content\": last_assistant}\n ]\n\n return {\n \"chat_template_input\": qwen_messages,\n \"target_text\": last_assistant # For debugging/validation\n }\n\n # Apply normalization\n normalized_dataset = dataset.map(normalize_to_qwen_format, remove_columns=dataset.column_names)\n normalized_dataset = normalized_dataset.filter(lambda x: x is not None)\n\n print(f\"After normalization: {len(normalized_dataset)} valid samples\")\n print(\"Normalized sample:\")\n print(normalized_dataset[0])\n\n model_name = \"Qwen/Qwen2.5-7B-Instruct\" # Or whatever Qwen variant you're using\n tokenizer = AutoTokenizer.from_pretrained(model_name)\n\n # Ensure special tokens are set correctly\n if tokenizer.pad_token is None:\n tokenizer.pad_token = tokenizer.eos_token\n\n def tokenize_and_mask(example):\n messages = example[\"chat_template_input\"]\n\n # Apply chat template — this adds all necessary special tokens\n text = tokenizer.apply_chat_template(\n messages,\n tokenize=False,\n add_generation_prompt=False # We’re training, not generating\n )\n\n # Tokenize\n enc = tokenizer(\n text,\n truncation=True,\n max_length=1024, # Adjust based on your VRAM\n padding=False,\n return_tensors=None\n )\n\n # Now create labels: mask everything except assistant response\n # Find where assistant response starts\n assistant_start_marker = \"assistant\\n\"\n assistant_end_marker = \"\n",
"title": "Qwen3.5-4B loss exploding"
}