External Publication
Visit Post

Is this a common/reasonable recipe for full finetuning Qwen3.5-4B?

Hugging Face Forums [Unofficial] April 2, 2026
Source

I’m about to run a full FT on Qwen/Qwen3.5-4B for a PT-BR legal assistant dataset and wanted a sanity check before I burn a bunch of GPU time.

This is not LoRA , just straight full finetuning.

Setup right now:

  • model: Qwen/Qwen3.5-4B

  • data: chat dataset with a messages field

  • domain: Brazilian legal

  • max length: 1024

  • split: 95/5 random

  • epochs: 1

  • lr: 1e-5

  • wd: 0.1

  • warmup: 0.03

  • scheduler: cosine

  • batch size: 4

  • grad accum: 4

  • precision: bf16 if available, else fp16

  • grad checkpointing: on

  • packing: off

  • optimizer: adamw_torch_fused

What I’m doing is basically:

  • normalize messages

  • apply Qwen chat template

  • drop samples over max length

  • train with trl.SFTTrainer

Core training code is roughly:

from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import SFTTrainer, SFTConfig
import torch

MODEL_NAME = "Qwen/Qwen3.5-4B"
MAX_LENGTH = 1024

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    trust_remote_code=True,
    dtype=torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16,
    low_cpu_mem_usage=True,
)

for p in model.parameters():
    p.requires_grad = True

model.config.use_cache = False

args = SFTConfig(
    output_dir="output",
    num_train_epochs=1,
    learning_rate=1e-5,
    weight_decay=0.1,
    warmup_ratio=0.03,
    lr_scheduler_type="cosine",
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    gradient_accumulation_steps=4,
    bf16=torch.cuda.is_bf16_supported(),
    fp16=not torch.cuda.is_bf16_supported(),
    tf32=True,
    gradient_checkpointing=True,
    packing=False,
    max_length=MAX_LENGTH,
    eval_strategy="steps",
    eval_steps=100,
    save_strategy="steps",
    save_steps=100,
    report_to="none",
    remove_unused_columns=False,
    eos_token=tokenizer.eos_token,
    pad_token=tokenizer.pad_token,
)

trainer = SFTTrainer(
    model=model,
    args=args,
    train_dataset=train_ds,
    eval_dataset=eval_ds,
    processing_class=tokenizer,
)
trainer.train()

Main thing I’m trying to figure out is: is this a common/reasonable recipe , or am I missing some Qwen-specific gotcha?

Stuff I’m unsure about:

  • should I be using Qwen/Qwen3.5-4B-Base instead of the post-trained one?

  • for Qwen chat data, is messages + SFTTrainer enough, or is there some masking/template detail that matters a lot?

  • would you train on the whole formatted conversation, or only assistant tokens?

  • do any of these hparams look obviously off for domain adaptation?

  • any known Qwen3.5 full FT traps?

Not looking for the “best possible” setup, mostly just trying to make sure this is a normal/sane way to do it.

Anyone here already fine-tuned Qwen3.5 and can say whether this looks reasonable?

Discussion in the ATmosphere

Loading comments...