Is this a common/reasonable recipe for full finetuning Qwen3.5-4B?
I’m about to run a full FT on Qwen/Qwen3.5-4B for a PT-BR legal assistant dataset and wanted a sanity check before I burn a bunch of GPU time.
This is not LoRA , just straight full finetuning.
Setup right now:
model:
Qwen/Qwen3.5-4Bdata: chat dataset with a
messagesfielddomain: Brazilian legal
max length: 1024
split: 95/5 random
epochs: 1
lr:
1e-5wd:
0.1warmup:
0.03scheduler: cosine
batch size: 4
grad accum: 4
precision: bf16 if available, else fp16
grad checkpointing: on
packing: off
optimizer:
adamw_torch_fused
What I’m doing is basically:
normalize
messagesapply Qwen chat template
drop samples over max length
train with
trl.SFTTrainer
Core training code is roughly:
from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import SFTTrainer, SFTConfig
import torch
MODEL_NAME = "Qwen/Qwen3.5-4B"
MAX_LENGTH = 1024
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True)
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"
model = AutoModelForCausalLM.from_pretrained(
MODEL_NAME,
trust_remote_code=True,
dtype=torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16,
low_cpu_mem_usage=True,
)
for p in model.parameters():
p.requires_grad = True
model.config.use_cache = False
args = SFTConfig(
output_dir="output",
num_train_epochs=1,
learning_rate=1e-5,
weight_decay=0.1,
warmup_ratio=0.03,
lr_scheduler_type="cosine",
per_device_train_batch_size=4,
per_device_eval_batch_size=4,
gradient_accumulation_steps=4,
bf16=torch.cuda.is_bf16_supported(),
fp16=not torch.cuda.is_bf16_supported(),
tf32=True,
gradient_checkpointing=True,
packing=False,
max_length=MAX_LENGTH,
eval_strategy="steps",
eval_steps=100,
save_strategy="steps",
save_steps=100,
report_to="none",
remove_unused_columns=False,
eos_token=tokenizer.eos_token,
pad_token=tokenizer.pad_token,
)
trainer = SFTTrainer(
model=model,
args=args,
train_dataset=train_ds,
eval_dataset=eval_ds,
processing_class=tokenizer,
)
trainer.train()
Main thing I’m trying to figure out is: is this a common/reasonable recipe , or am I missing some Qwen-specific gotcha?
Stuff I’m unsure about:
should I be using
Qwen/Qwen3.5-4B-Baseinstead of the post-trained one?for Qwen chat data, is
messages+SFTTrainerenough, or is there some masking/template detail that matters a lot?would you train on the whole formatted conversation, or only assistant tokens?
do any of these hparams look obviously off for domain adaptation?
any known Qwen3.5 full FT traps?
Not looking for the “best possible” setup, mostly just trying to make sure this is a normal/sane way to do it.
Anyone here already fine-tuned Qwen3.5 and can say whether this looks reasonable?
Discussion in the ATmosphere