External Publication

Qwen3.5-4B loss exploding

Hugging Face Forums [Unofficial] March 14, 2026

This is all I got!

Happy Pi Day! 3.14159… and Einstein’s birthday too!

Looking at your Qwen training explosion issue, here are some additional insights beyond what’s already been discussed:

Often Overlooked Issues:

1. Sequence Length Truncation

# Check if your long reasoning traces are getting brutally truncated
max_length = 1024  # This might be TOO SHORT for Claude/Gemini reasoning
# Try: 2048 or 4096 if memory allows

# Check truncation rate:
truncated = sum(1 for sample in dataset if len(tokenizer.encode(sample)) > max_length)
print(f"Truncated samples: {truncated}/{len(dataset)} ({100*truncated/len(dataset):.1f}%)")

2. Label Shift Alignment

# Qwen might expect labels shifted by 1 position
# Verify your labels align with predictions correctly
input_ids = batch["input_ids"]
labels = batch["labels"]

# Check if labels are properly shifted
assert labels.shape == input_ids.shape, "Shape mismatch!"

3. Gradient Accumulation Gotcha

# With gradient_accumulation_steps=4 and batch_size=1,
# your effective batch is 4, but gradient noise is HIGHER
# Try: batch_size=2, accumulation_steps=2 (same effective batch, less noise)

Quick Diagnostic Script:

def diagnose_training_batch(batch, tokenizer):
    """Run this on a few batches before training"""
    labels = batch["labels"]
    input_ids = batch["input_ids"]

    print("=== BATCH DIAGNOSTICS ===")
    print(f"Batch size: {labels.shape[0]}")
    print(f"Sequence length: {labels.shape[1]}")

    # Check supervision ratio
    supervised_tokens = (labels != -100).sum().item()
    total_tokens = labels.numel()
    print(f"Supervision ratio: {supervised_tokens/total_tokens:.2%}")

    # Check for all -100 rows (no supervision!)
    no_supervision = (labels == -100).all(dim=1).sum().item()
    print(f"Samples with NO supervision: {no_supervision}")

    # Check for NaN/Inf in inputs
    print(f"NaN in input_ids: {torch.isnan(input_ids).any().item()}")
    print(f"Inf in input_ids: {torch.isinf(input_ids).any().item()}")

    # Decode first supervised tokens
    for i in range(min(2, labels.shape[0])):
        kept = labels[i][labels[i] != -100]
        if kept.numel() > 0:
            print(f"\nSample {i} supervised text:")
            print(tokenizer.decode(kept[:50], skip_special_tokens=False))

Qwen3.5-Specific Fix:

Since Qwen3.5 has built-in thinking , you might need to:

# Option 1: Strip thinking from your training data
def strip_thinking(response):
    """Remove <think> tags if present"""
    import re
    return re.sub(r'<think>.*?</think>', '', response, flags=re.DOTALL)

# Option 2: Or embrace it - keep thinking but format correctly
def format_for_qwen_thinking(user_msg, reasoning, final_answer):
    return [
        {"role": "user", "content": user_msg},
        {"role": "assistant", "content": f"<think>{reasoning}</think>{final_answer}"}
    ]

Learning Rate Schedule Tweak:

# Instead of just warmup_ratio=0.03, try:
warmup_steps = 100  # Fixed steps instead of ratio
lr_scheduler_type = "cosine_with_restarts"  # More stable than linear

Emergency Stabilization:

If it’s still exploding:

# Nuclear options:
max_grad_norm = 0.1  # More aggressive clipping
optim = "adamw_torch_fused"  # More stable
gradient_checkpointing = True  # Reduce memory pressure

Want me to write a complete preprocessing pipeline that:

Loads your shuffled Claude/Gemini dataset
Normalizes formatting to Qwen’s expectations
Applies proper chat templates
Validates supervision before training?

Just say the word! And may your gradients be stable like π is constant!

Check if your long reasoning traces are getting brutally truncated

max_length = 1024 # This might be TOO SHORT for Claude/Gemini reasoning

Try: 2048 or 4096 if memory allows

Check truncation rate:

truncated = sum(1 for sample in dataset if len(tokenizer.encode(sample)) > max_length) print(f"Truncated samples: {truncated}/{len(dataset)} ({100*truncated/len(dataset):.1f}%)")

Qwen might expect labels shifted by 1 position

Verify your labels align with predictions correctly

input_ids = batch[“input_ids”] labels = batch[“labels”]

Check if labels are properly shifted

assert labels.shape == input_ids.shape, “Shape mismatch!”

With gradient_accumulation_steps=4 and batch_size=1,

your effective batch is 4, but gradient noise is HIGHER

Try: batch_size=2, accumulation_steps=2 (same effective batch, less noise)

def diagnose_training_batch(batch, tokenizer): “”“Run this on a few batches before training”“” labels = batch[“labels”] input_ids = batch[“input_ids”]

print("=== BATCH DIAGNOSTICS ===")
print(f"Batch size: {labels.shape[0]}")
print(f"Sequence length: {labels.shape[1]}")

# Check supervision ratio
supervised_tokens = (labels != -100).sum().item()
total_tokens = labels.numel()
print(f"Supervision ratio: {supervised_tokens/total_tokens:.2%}")

# Check for all -100 rows (no supervision!)
no_supervision = (labels == -100).all(dim=1).sum().item()
print(f"Samples with NO supervision: {no_supervision}")

# Check for NaN/Inf in inputs
print(f"NaN in input_ids: {torch.isnan(input_ids).any().item()}")
print(f"Inf in input_ids: {torch.isinf(input_ids).any().item()}")

# Decode first supervised tokens
for i in range(min(2, labels.shape[0])):
    kept = labels[i][labels[i] != -100]
    if kept.numel() > 0:
        print(f"\nSample {i} supervised text:")
        print(tokenizer.decode(kept[:50], skip_special_tokens=False))

Option 1: Strip thinking from your training data

def strip_thinking(response): “”“Remove tags if present”“” import re return re.sub(r’.*?', ‘’, response, flags=re.DOTALL)

Option 2: Or embrace it - keep thinking but format correctly

def format_for_qwen_thinking(user_msg, reasoning, final_answer): return [ {“role”: “user”, “content”: user_msg}, {“role”: “assistant”, “content”: f"{reasoning}{final_answer}"} ]

Instead of just warmup_ratio=0.03, try:

warmup_steps = 100 # Fixed steps instead of ratio lr_scheduler_type = “cosine_with_restarts” # More stable than linear

Nuclear options:

max_grad_norm = 0.1 # More aggressive clipping optim = “adamw_torch_fused” # More stable gradient_checkpointing = True # Reduce memory pressure

Yes That would be perfect Do The Best You Can! Chat Qwen Ai!

You got it, Eric! Let’s build you a complete, battle-tested preprocessing pipeline tailored for your shuffled Claude + Gemini reasoning dataset — designed to stop the loss explosions , align with Qwen3.5’s “think-then-answer” template , and give you clean, stable supervision tokens.

We’ll do this in 4 phases:

PHASE 1: LOAD & INSPECT YOUR DATASET

import json
import pandas as pd
from datasets import load_dataset
from transformers import AutoTokenizer

# Load your dataset (adjust path or source as needed)
# Example: if it's a JSONL file on Hugging Face Hub or local
dataset = load_dataset("json", data_files="your_shuffled_reasoning_data.jsonl", split="train")

# OR if it's already a HF dataset:
# dataset = load_dataset("your_username/your_dataset_name")

print(f"Loaded {len(dataset)} samples")
print("Sample raw entry:")
print(dataset[0])

PHASE 2: NORMALIZE FORMAT TO QWEN’S EXPECTATIONS

Qwen3.5 expects:

User message → Assistant thinks → Assistant answers
But during fine-tuning, we usually want to supervise ONLY the final answer (or optionally include thinking if formatted correctly)

We’ll assume you want to train on final answers only (safest for stability). If you want to include reasoning, I’ll show you how later.

def normalize_to_qwen_format(example):
    """
    Converts mixed Claude/Gemini outputs into Qwen-compatible chat format.
    Assumes each sample has 'messages' key with list of {role, content} dicts.
    We'll extract only the LAST assistant response as the target.
    """
    messages = example.get("messages", [])

    # Filter out system messages if any
    user_msgs = [m for m in messages if m["role"] == "user"]
    assistant_msgs = [m for m in messages if m["role"] == "assistant"]

    if not user_msgs or not assistant_msgs:
        return None  # Skip invalid samples

    # Take last user prompt and last assistant response
    last_user = user_msgs[-1]["content"]
    last_assistant = assistant_msgs[-1]["content"]

    # Build Qwen-style chat template input
    qwen_messages = [
        {"role": "user", "content": last_user},
        {"role": "assistant", "content": last_assistant}
    ]

    return {
        "chat_template_input": qwen_messages,
        "target_text": last_assistant  # For debugging/validation
    }

# Apply normalization
normalized_dataset = dataset.map(normalize_to_qwen_format, remove_columns=dataset.column_names)
normalized_dataset = normalized_dataset.filter(lambda x: x is not None)

print(f"After normalization: {len(normalized_dataset)} valid samples")
print("Normalized sample:")
print(normalized_dataset[0])

If you want to INCLUDE reasoning steps (e.g., “Let me think step by step…”), let me know — we can preserve them but wrap them in<thought> tags or similar, depending on Qwen’s current spec.

PHASE 3: APPLY CHAT TEMPLATE + TOKENIZE WITH PROPER MASKING

This is where most explosions happen — improper labeling or tokenization.

model_name = "Qwen/Qwen2.5-7B-Instruct"  # Or whatever Qwen variant you're using
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Ensure special tokens are set correctly
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

def tokenize_and_mask(example):
    messages = example["chat_template_input"]

    # Apply chat template — this adds all necessary special tokens
    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=False  # We’re training, not generating
    )

    # Tokenize
    enc = tokenizer(
        text,
        truncation=True,
        max_length=1024,  # Adjust based on your VRAM
        padding=False,
        return_tensors=None
    )

    # Now create labels: mask everything except assistant response
    # Find where assistant response starts
    assistant_start_marker = "assistant\n"
    assistant_end_marker = "

import json
import pandas as pd
from datasets import load_dataset
from transformers import AutoTokenizer

# Load your dataset (adjust path or source as needed)
# Example: if it's a JSONL file on Hugging Face Hub or local
dataset = load_dataset("json", data_files="your_shuffled_reasoning_data.jsonl", split="train")

# OR if it's already a HF dataset:
# dataset = load_dataset("your_username/your_dataset_name")

print(f"Loaded {len(dataset)} samples")
print("Sample raw entry:")
print(dataset[0])

def normalize_to_qwen_format(example):
    """
    Converts mixed Claude/Gemini outputs into Qwen-compatible chat format.
    Assumes each sample has 'messages' key with list of {role, content} dicts.
    We'll extract only the LAST assistant response as the target.
    """
    messages = example.get("messages", [])

    # Filter out system messages if any
    user_msgs = [m for m in messages if m["role"] == "user"]
    assistant_msgs = [m for m in messages if m["role"] == "assistant"]

    if not user_msgs or not assistant_msgs:
        return None  # Skip invalid samples

    # Take last user prompt and last assistant response
    last_user = user_msgs[-1]["content"]
    last_assistant = assistant_msgs[-1]["content"]

    # Build Qwen-style chat template input
    qwen_messages = [
        {"role": "user", "content": last_user},
        {"role": "assistant", "content": last_assistant}
    ]

    return {
        "chat_template_input": qwen_messages,
        "target_text": last_assistant  # For debugging/validation
    }

# Apply normalization
normalized_dataset = dataset.map(normalize_to_qwen_format, remove_columns=dataset.column_names)
normalized_dataset = normalized_dataset.filter(lambda x: x is not None)

print(f"After normalization: {len(normalized_dataset)} valid samples")
print("Normalized sample:")
print(normalized_dataset[0])

model_name = "Qwen/Qwen2.5-7B-Instruct"  # Or whatever Qwen variant you're using
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Ensure special tokens are set correctly
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

def tokenize_and_mask(example):
    messages = example["chat_template_input"]

    # Apply chat template — this adds all necessary special tokens
    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=False  # We’re training, not generating
    )

    # Tokenize
    enc = tokenizer(
        text,
        truncation=True,
        max_length=1024,  # Adjust based on your VRAM
        padding=False,
        return_tensors=None
    )

    # Now create labels: mask everything except assistant response
    # Find where assistant response starts
    assistant_start_marker = "assistant\n"
    assistant_end_marker = "