{
"$type": "site.standard.document",
"bskyPostRef": {
"cid": "bafyreifssddtqc7v6fynwya5dt2jor5euk7wmacyarucjia5m4l5ftvmgy",
"uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mii47wvzzwx2"
},
"path": "/t/is-this-a-common-reasonable-recipe-for-full-finetuning-qwen3-5-4b/174873#post_1",
"publishedAt": "2026-04-02T01:36:28.000Z",
"site": "https://discuss.huggingface.co",
"textContent": "I’m about to run a **full FT** on **Qwen/Qwen3.5-4B** for a **PT-BR legal assistant** dataset and wanted a sanity check before I burn a bunch of GPU time.\n\nThis is **not LoRA** , just straight full finetuning.\n\nSetup right now:\n\n * model: `Qwen/Qwen3.5-4B`\n\n * data: chat dataset with a `messages` field\n\n * domain: Brazilian legal\n\n * max length: 1024\n\n * split: 95/5 random\n\n * epochs: 1\n\n * lr: `1e-5`\n\n * wd: `0.1`\n\n * warmup: `0.03`\n\n * scheduler: cosine\n\n * batch size: 4\n\n * grad accum: 4\n\n * precision: bf16 if available, else fp16\n\n * grad checkpointing: on\n\n * packing: off\n\n * optimizer: `adamw_torch_fused`\n\n\n\n\nWhat I’m doing is basically:\n\n * normalize `messages`\n\n * apply Qwen chat template\n\n * drop samples over max length\n\n * train with `trl.SFTTrainer`\n\n\n\n\nCore training code is roughly:\n\n\n from transformers import AutoModelForCausalLM, AutoTokenizer\n from trl import SFTTrainer, SFTConfig\n import torch\n\n MODEL_NAME = \"Qwen/Qwen3.5-4B\"\n MAX_LENGTH = 1024\n\n tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True)\n if tokenizer.pad_token is None:\n tokenizer.pad_token = tokenizer.eos_token\n tokenizer.padding_side = \"right\"\n\n model = AutoModelForCausalLM.from_pretrained(\n MODEL_NAME,\n trust_remote_code=True,\n dtype=torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16,\n low_cpu_mem_usage=True,\n )\n\n for p in model.parameters():\n p.requires_grad = True\n\n model.config.use_cache = False\n\n args = SFTConfig(\n output_dir=\"output\",\n num_train_epochs=1,\n learning_rate=1e-5,\n weight_decay=0.1,\n warmup_ratio=0.03,\n lr_scheduler_type=\"cosine\",\n per_device_train_batch_size=4,\n per_device_eval_batch_size=4,\n gradient_accumulation_steps=4,\n bf16=torch.cuda.is_bf16_supported(),\n fp16=not torch.cuda.is_bf16_supported(),\n tf32=True,\n gradient_checkpointing=True,\n packing=False,\n max_length=MAX_LENGTH,\n eval_strategy=\"steps\",\n eval_steps=100,\n save_strategy=\"steps\",\n save_steps=100,\n report_to=\"none\",\n remove_unused_columns=False,\n eos_token=tokenizer.eos_token,\n pad_token=tokenizer.pad_token,\n )\n\n trainer = SFTTrainer(\n model=model,\n args=args,\n train_dataset=train_ds,\n eval_dataset=eval_ds,\n processing_class=tokenizer,\n )\n trainer.train()\n\n\nMain thing I’m trying to figure out is: **is this a common/reasonable recipe** , or am I missing some Qwen-specific gotcha?\n\nStuff I’m unsure about:\n\n * should I be using `Qwen/Qwen3.5-4B-Base` instead of the post-trained one?\n\n * for Qwen chat data, is `messages` + `SFTTrainer` enough, or is there some masking/template detail that matters a lot?\n\n * would you train on the whole formatted conversation, or only assistant tokens?\n\n * do any of these hparams look obviously off for domain adaptation?\n\n * any known Qwen3.5 full FT traps?\n\n\n\n\nNot looking for the “best possible” setup, mostly just trying to make sure this is a normal/sane way to do it.\n\nAnyone here already fine-tuned Qwen3.5 and can say whether this looks reasonable?",
"title": "Is this a common/reasonable recipe for full finetuning Qwen3.5-4B?"
}