Qwen3.5-4B loss exploding
This is all I got!
Happy Pi Day! 3.14159… and Einstein’s birthday too!
Looking at your Qwen training explosion issue, here are some additional insights beyond what’s already been discussed:
Often Overlooked Issues:
1. Sequence Length Truncation
# Check if your long reasoning traces are getting brutally truncated
max_length = 1024 # This might be TOO SHORT for Claude/Gemini reasoning
# Try: 2048 or 4096 if memory allows
# Check truncation rate:
truncated = sum(1 for sample in dataset if len(tokenizer.encode(sample)) > max_length)
print(f"Truncated samples: {truncated}/{len(dataset)} ({100*truncated/len(dataset):.1f}%)")
2. Label Shift Alignment
# Qwen might expect labels shifted by 1 position
# Verify your labels align with predictions correctly
input_ids = batch["input_ids"]
labels = batch["labels"]
# Check if labels are properly shifted
assert labels.shape == input_ids.shape, "Shape mismatch!"
3. Gradient Accumulation Gotcha
# With gradient_accumulation_steps=4 and batch_size=1,
# your effective batch is 4, but gradient noise is HIGHER
# Try: batch_size=2, accumulation_steps=2 (same effective batch, less noise)
Quick Diagnostic Script:
def diagnose_training_batch(batch, tokenizer):
"""Run this on a few batches before training"""
labels = batch["labels"]
input_ids = batch["input_ids"]
print("=== BATCH DIAGNOSTICS ===")
print(f"Batch size: {labels.shape[0]}")
print(f"Sequence length: {labels.shape[1]}")
# Check supervision ratio
supervised_tokens = (labels != -100).sum().item()
total_tokens = labels.numel()
print(f"Supervision ratio: {supervised_tokens/total_tokens:.2%}")
# Check for all -100 rows (no supervision!)
no_supervision = (labels == -100).all(dim=1).sum().item()
print(f"Samples with NO supervision: {no_supervision}")
# Check for NaN/Inf in inputs
print(f"NaN in input_ids: {torch.isnan(input_ids).any().item()}")
print(f"Inf in input_ids: {torch.isinf(input_ids).any().item()}")
# Decode first supervised tokens
for i in range(min(2, labels.shape[0])):
kept = labels[i][labels[i] != -100]
if kept.numel() > 0:
print(f"\nSample {i} supervised text:")
print(tokenizer.decode(kept[:50], skip_special_tokens=False))
Qwen3.5-Specific Fix:
Since Qwen3.5 has built-in thinking , you might need to:
# Option 1: Strip thinking from your training data
def strip_thinking(response):
"""Remove <think> tags if present"""
import re
return re.sub(r'<think>.*?</think>', '', response, flags=re.DOTALL)
# Option 2: Or embrace it - keep thinking but format correctly
def format_for_qwen_thinking(user_msg, reasoning, final_answer):
return [
{"role": "user", "content": user_msg},
{"role": "assistant", "content": f"<think>{reasoning}</think>{final_answer}"}
]
Learning Rate Schedule Tweak:
# Instead of just warmup_ratio=0.03, try:
warmup_steps = 100 # Fixed steps instead of ratio
lr_scheduler_type = "cosine_with_restarts" # More stable than linear
Emergency Stabilization:
If it’s still exploding:
# Nuclear options:
max_grad_norm = 0.1 # More aggressive clipping
optim = "adamw_torch_fused" # More stable
gradient_checkpointing = True # Reduce memory pressure
Want me to write a complete preprocessing pipeline that:
- Loads your shuffled Claude/Gemini dataset
- Normalizes formatting to Qwen’s expectations
- Applies proper chat templates
- Validates supervision before training?
Just say the word! And may your gradients be stable like π is constant!
Check if your long reasoning traces are getting brutally truncated
max_length = 1024 # This might be TOO SHORT for Claude/Gemini reasoning
Try: 2048 or 4096 if memory allows
Check truncation rate:
truncated = sum(1 for sample in dataset if len(tokenizer.encode(sample)) > max_length) print(f"Truncated samples: {truncated}/{len(dataset)} ({100*truncated/len(dataset):.1f}%)")
Qwen might expect labels shifted by 1 position
Verify your labels align with predictions correctly
input_ids = batch[“input_ids”] labels = batch[“labels”]
Check if labels are properly shifted
assert labels.shape == input_ids.shape, “Shape mismatch!”
With gradient_accumulation_steps=4 and batch_size=1,
your effective batch is 4, but gradient noise is HIGHER
Try: batch_size=2, accumulation_steps=2 (same effective batch, less noise)
def diagnose_training_batch(batch, tokenizer): “”“Run this on a few batches before training”“” labels = batch[“labels”] input_ids = batch[“input_ids”]
print("=== BATCH DIAGNOSTICS ===")
print(f"Batch size: {labels.shape[0]}")
print(f"Sequence length: {labels.shape[1]}")
# Check supervision ratio
supervised_tokens = (labels != -100).sum().item()
total_tokens = labels.numel()
print(f"Supervision ratio: {supervised_tokens/total_tokens:.2%}")
# Check for all -100 rows (no supervision!)
no_supervision = (labels == -100).all(dim=1).sum().item()
print(f"Samples with NO supervision: {no_supervision}")
# Check for NaN/Inf in inputs
print(f"NaN in input_ids: {torch.isnan(input_ids).any().item()}")
print(f"Inf in input_ids: {torch.isinf(input_ids).any().item()}")
# Decode first supervised tokens
for i in range(min(2, labels.shape[0])):
kept = labels[i][labels[i] != -100]
if kept.numel() > 0:
print(f"\nSample {i} supervised text:")
print(tokenizer.decode(kept[:50], skip_special_tokens=False))
Option 1: Strip thinking from your training data
def strip_thinking(response): “”“Remove tags if present”“” import re return re.sub(r’.*?', ‘’, response, flags=re.DOTALL)
Option 2: Or embrace it - keep thinking but format correctly
def format_for_qwen_thinking(user_msg, reasoning, final_answer): return [ {“role”: “user”, “content”: user_msg}, {“role”: “assistant”, “content”: f"{reasoning}{final_answer}"} ]
Instead of just warmup_ratio=0.03, try:
warmup_steps = 100 # Fixed steps instead of ratio lr_scheduler_type = “cosine_with_restarts” # More stable than linear
Nuclear options:
max_grad_norm = 0.1 # More aggressive clipping optim = “adamw_torch_fused” # More stable gradient_checkpointing = True # Reduce memory pressure
Yes That would be perfect Do The Best You Can! Chat Qwen Ai!
You got it, Eric! Let’s build you a complete, battle-tested preprocessing pipeline tailored for your shuffled Claude + Gemini reasoning dataset — designed to stop the loss explosions , align with Qwen3.5’s “think-then-answer” template , and give you clean, stable supervision tokens.
We’ll do this in 4 phases:
PHASE 1: LOAD & INSPECT YOUR DATASET
import json
import pandas as pd
from datasets import load_dataset
from transformers import AutoTokenizer
# Load your dataset (adjust path or source as needed)
# Example: if it's a JSONL file on Hugging Face Hub or local
dataset = load_dataset("json", data_files="your_shuffled_reasoning_data.jsonl", split="train")
# OR if it's already a HF dataset:
# dataset = load_dataset("your_username/your_dataset_name")
print(f"Loaded {len(dataset)} samples")
print("Sample raw entry:")
print(dataset[0])
PHASE 2: NORMALIZE FORMAT TO QWEN’S EXPECTATIONS
Qwen3.5 expects:
- User message → Assistant thinks → Assistant answers
- But during fine-tuning, we usually want to supervise ONLY the final answer (or optionally include thinking if formatted correctly)
We’ll assume you want to train on final answers only (safest for stability). If you want to include reasoning, I’ll show you how later.
def normalize_to_qwen_format(example):
"""
Converts mixed Claude/Gemini outputs into Qwen-compatible chat format.
Assumes each sample has 'messages' key with list of {role, content} dicts.
We'll extract only the LAST assistant response as the target.
"""
messages = example.get("messages", [])
# Filter out system messages if any
user_msgs = [m for m in messages if m["role"] == "user"]
assistant_msgs = [m for m in messages if m["role"] == "assistant"]
if not user_msgs or not assistant_msgs:
return None # Skip invalid samples
# Take last user prompt and last assistant response
last_user = user_msgs[-1]["content"]
last_assistant = assistant_msgs[-1]["content"]
# Build Qwen-style chat template input
qwen_messages = [
{"role": "user", "content": last_user},
{"role": "assistant", "content": last_assistant}
]
return {
"chat_template_input": qwen_messages,
"target_text": last_assistant # For debugging/validation
}
# Apply normalization
normalized_dataset = dataset.map(normalize_to_qwen_format, remove_columns=dataset.column_names)
normalized_dataset = normalized_dataset.filter(lambda x: x is not None)
print(f"After normalization: {len(normalized_dataset)} valid samples")
print("Normalized sample:")
print(normalized_dataset[0])
If you want to INCLUDE reasoning steps (e.g., “Let me think step by step…”), let me know — we can preserve them but wrap them in
<thought>tags or similar, depending on Qwen’s current spec.
PHASE 3: APPLY CHAT TEMPLATE + TOKENIZE WITH PROPER MASKING
This is where most explosions happen — improper labeling or tokenization.
model_name = "Qwen/Qwen2.5-7B-Instruct" # Or whatever Qwen variant you're using
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Ensure special tokens are set correctly
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
def tokenize_and_mask(example):
messages = example["chat_template_input"]
# Apply chat template — this adds all necessary special tokens
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=False # We’re training, not generating
)
# Tokenize
enc = tokenizer(
text,
truncation=True,
max_length=1024, # Adjust based on your VRAM
padding=False,
return_tensors=None
)
# Now create labels: mask everything except assistant response
# Find where assistant response starts
assistant_start_marker = "assistant\n"
assistant_end_marker = "
import json
import pandas as pd
from datasets import load_dataset
from transformers import AutoTokenizer
# Load your dataset (adjust path or source as needed)
# Example: if it's a JSONL file on Hugging Face Hub or local
dataset = load_dataset("json", data_files="your_shuffled_reasoning_data.jsonl", split="train")
# OR if it's already a HF dataset:
# dataset = load_dataset("your_username/your_dataset_name")
print(f"Loaded {len(dataset)} samples")
print("Sample raw entry:")
print(dataset[0])
def normalize_to_qwen_format(example):
"""
Converts mixed Claude/Gemini outputs into Qwen-compatible chat format.
Assumes each sample has 'messages' key with list of {role, content} dicts.
We'll extract only the LAST assistant response as the target.
"""
messages = example.get("messages", [])
# Filter out system messages if any
user_msgs = [m for m in messages if m["role"] == "user"]
assistant_msgs = [m for m in messages if m["role"] == "assistant"]
if not user_msgs or not assistant_msgs:
return None # Skip invalid samples
# Take last user prompt and last assistant response
last_user = user_msgs[-1]["content"]
last_assistant = assistant_msgs[-1]["content"]
# Build Qwen-style chat template input
qwen_messages = [
{"role": "user", "content": last_user},
{"role": "assistant", "content": last_assistant}
]
return {
"chat_template_input": qwen_messages,
"target_text": last_assistant # For debugging/validation
}
# Apply normalization
normalized_dataset = dataset.map(normalize_to_qwen_format, remove_columns=dataset.column_names)
normalized_dataset = normalized_dataset.filter(lambda x: x is not None)
print(f"After normalization: {len(normalized_dataset)} valid samples")
print("Normalized sample:")
print(normalized_dataset[0])
model_name = "Qwen/Qwen2.5-7B-Instruct" # Or whatever Qwen variant you're using
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Ensure special tokens are set correctly
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
def tokenize_and_mask(example):
messages = example["chat_template_input"]
# Apply chat template — this adds all necessary special tokens
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=False # We’re training, not generating
)
# Tokenize
enc = tokenizer(
text,
truncation=True,
max_length=1024, # Adjust based on your VRAM
padding=False,
return_tensors=None
)
# Now create labels: mask everything except assistant response
# Find where assistant response starts
assistant_start_marker = "assistant\n"
assistant_end_marker = "
Discussion in the ATmosphere