Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreifssddtqc7v6fynwya5dt2jor5euk7wmacyarucjia5m4l5ftvmgy",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3miicx3xf2jz2"
  },
  "path": "/t/is-this-a-common-reasonable-recipe-for-full-finetuning-qwen3-5-4b/174873#post_1",
  "publishedAt": "2026-04-02T01:36:28.000Z",
  "site": "https://discuss.huggingface.co",
  "textContent": "I’m about to run a **full FT** on **Qwen/Qwen3.5-4B** for a **PT-BR legal assistant** dataset and wanted a sanity check before I burn a bunch of GPU time.\n\nThis is **not LoRA** , just straight full finetuning.\n\nSetup right now:\n\n  * model: `Qwen/Qwen3.5-4B`\n\n  * data: chat dataset with a `messages` field\n\n  * domain: Brazilian legal\n\n  * max length: 1024\n\n  * split: 95/5 random\n\n  * epochs: 1\n\n  * lr: `1e-5`\n\n  * wd: `0.1`\n\n  * warmup: `0.03`\n\n  * scheduler: cosine\n\n  * batch size: 4\n\n  * grad accum: 4\n\n  * precision: bf16 if available, else fp16\n\n  * grad checkpointing: on\n\n  * packing: off\n\n  * optimizer: `adamw_torch_fused`\n\n\n\n\nWhat I’m doing is basically:\n\n  * normalize `messages`\n\n  * apply Qwen chat template\n\n  * drop samples over max length\n\n  * train with `trl.SFTTrainer`\n\n\n\n\nCore training code is roughly:\n\n\n    from transformers import AutoModelForCausalLM, AutoTokenizer\n    from trl import SFTTrainer, SFTConfig\n    import torch\n\n    MODEL_NAME = \"Qwen/Qwen3.5-4B\"\n    MAX_LENGTH = 1024\n\n    tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True)\n    if tokenizer.pad_token is None:\n        tokenizer.pad_token = tokenizer.eos_token\n    tokenizer.padding_side = \"right\"\n\n    model = AutoModelForCausalLM.from_pretrained(\n        MODEL_NAME,\n        trust_remote_code=True,\n        dtype=torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16,\n        low_cpu_mem_usage=True,\n    )\n\n    for p in model.parameters():\n        p.requires_grad = True\n\n    model.config.use_cache = False\n\n    args = SFTConfig(\n        output_dir=\"output\",\n        num_train_epochs=1,\n        learning_rate=1e-5,\n        weight_decay=0.1,\n        warmup_ratio=0.03,\n        lr_scheduler_type=\"cosine\",\n        per_device_train_batch_size=4,\n        per_device_eval_batch_size=4,\n        gradient_accumulation_steps=4,\n        bf16=torch.cuda.is_bf16_supported(),\n        fp16=not torch.cuda.is_bf16_supported(),\n        tf32=True,\n        gradient_checkpointing=True,\n        packing=False,\n        max_length=MAX_LENGTH,\n        eval_strategy=\"steps\",\n        eval_steps=100,\n        save_strategy=\"steps\",\n        save_steps=100,\n        report_to=\"none\",\n        remove_unused_columns=False,\n        eos_token=tokenizer.eos_token,\n        pad_token=tokenizer.pad_token,\n    )\n\n    trainer = SFTTrainer(\n        model=model,\n        args=args,\n        train_dataset=train_ds,\n        eval_dataset=eval_ds,\n        processing_class=tokenizer,\n    )\n    trainer.train()\n\n\nMain thing I’m trying to figure out is: **is this a common/reasonable recipe** , or am I missing some Qwen-specific gotcha?\n\nStuff I’m unsure about:\n\n  * should I be using `Qwen/Qwen3.5-4B-Base` instead of the post-trained one?\n\n  * for Qwen chat data, is `messages` + `SFTTrainer` enough, or is there some masking/template detail that matters a lot?\n\n  * would you train on the whole formatted conversation, or only assistant tokens?\n\n  * do any of these hparams look obviously off for domain adaptation?\n\n  * any known Qwen3.5 full FT traps?\n\n\n\n\nNot looking for the “best possible” setup, mostly just trying to make sure this is a normal/sane way to do it.\n\nAnyone here already fine-tuned Qwen3.5 and can say whether this looks reasonable?",
  "title": "Is this a common/reasonable recipe for full finetuning Qwen3.5-4B?"
}