Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreih6rhvm3cd7flvfqtjqsrwfpfilqft7kqjswg6pxvl2rkyu3pgitu",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3miijnuqhchk2"
  },
  "path": "/t/is-this-a-common-reasonable-recipe-for-full-finetuning-qwen3-5-4b/174873#post_2",
  "publishedAt": "2026-04-02T02:29:16.000Z",
  "site": "https://discuss.huggingface.co",
  "textContent": "Recommendations from Chat Qwen Ai.\n\nThere are a few critical things to address here before you start burning GPU hours. The biggest one is a potential showstopper regarding the model name, followed by a crucial detail about loss masking that often gets overlooked with `SFTTrainer`.\n\nHere is the sanity check breakdown:\n\n### 1. The Model Name (Critical)\n\n**`Qwen/Qwen3.5-4B` does not exist publicly.**\n\nAs of late 2024, the latest public release is the **Qwen2.5** series. The available sizes are **0.5B, 1.5B, 3B, 7B, 14B, 32B, and 72B**. There is no 4B variant, and there is no 3.5 version.\n\n* **Action:** Double-check your model string. You likely mean `Qwen/Qwen2.5-3B-Instruct` or `Qwen/Qwen2.5-7B-Instruct`.\n\n* **Risk:** If you proceed with a non-existent model path, your script will fail immediately. If you are using a private/leaked checkpoint, ignore this, but assume the architecture matches Qwen2.5 for the advice below.\n\n### 2. Base vs. Instruct\n\n**Recommendation: Use the `-Instruct` version.**\n\n* **Why:** You are building a “Legal Assistant.” You want the model to retain its general chat capabilities, safety alignments, and instruction-following structure while adapting to the legal domain.\n\n* **Base Model:** If you use the Base model, you are teaching it *how* to chat and *how* to be a lawyer simultaneously. With only 1 epoch and a chat dataset, you risk the model not learning the chat format well enough, resulting in outputs that don’t stop correctly or ignore system prompts.\n\n* **Instruct Model:** It already knows how to chat. You are just shifting its knowledge distribution to Brazilian Law. This is much more sample-efficient.\n\n### 3. Loss Masking (The Biggest Technical Risk)\n\n**`SFTTrainer` does not automatically mask user tokens.**\n\nBy default, a Causal Language Model calculates loss on **every token** in the sequence. If you feed it a conversation:\n\n`[User]: What is the law? [Assistant]: The law is…`\n\nThe model will try to predict the User’s prompt as well as the Assistant’s response. This is noise and degrades performance.\n\n**You must mask the User/System tokens (set labels to -100) so loss is only calculated on Assistant tokens.**\n\n`SFTTrainer` with a `messages` column will apply the chat template, but it typically does **not** handle the loss masking for you automatically in a way that is safe for production without preprocessing.\n\n**Fix:** Preprocess your dataset to generate `input_ids` and `labels` manually before passing to the trainer, or use a custom data collator. Here is a robust pattern for Qwen:\n\n```python\n\ndef preprocess(examples):\n\n\n    chat_batch = examples\\[\"messages\"\\]\n\n    \\# Apply template to get the full text\n\n    texts = \\[tokenizer.apply_chat_template(msg, tokenize=False, add_generation_prompt=False) for msg in chat_batch\\]\n\n    \\# Tokenize\n\n    tokenized = tokenizer(texts, truncation=True, max_length=MAX_LENGTH, padding=False)\n\n\n\n    \\# Now create labels with masking\n\n    labels = \\[\\]\n\n    for i, msg in enumerate(chat_batch):\n\n        \\# Re-tokenize specifically to find where assistant turns start\n\n        \\# This is a simplified logic; robust masking requires tracking token offsets\n\n        \\# A simpler way with TRL is to use the 'SFTTrainer' packing or custom collator,\n\n        \\# BUT for full FT safety, explicit masking is best.\n\n        pass\n\n    \\# ... (See note below on easier way)\n\n\n```\n\n**Easier Way with TRL:**\n\nIf you are on the latest `trl`, you can rely on the tokenizer, but you **must** ensure you are not training on user input. A common workaround if you don’t want to write a complex masking function is to format your dataset such that the `text` field *only* contains the Assistant response, preceded by the context in a way the model accepts, **OR** accept that `SFTTrainer` with `messages` might train on everything depending on version.\n\n**Best Practice for Qwen:**\n\nUse a custom `constant_length_dataset` or map function that sets `labels` to `-100` for all tokens except those belonging to the `assistant` role. If you skip this, your legal assistant might start hallucinating user queries.\n\n### 4. Hyperparameter Sanity Check\n\nYour hyperparameters are generally **very sane** for Full FT, with one caveat on sequence length.\n\n* **LR (`1e-5`):** Perfect for Full FT. (LoRA would use `2e-4` or higher, but Full FT needs to be more conservative to avoid catastrophic forgetting).\n\n* **Epochs (`1`):** Correct. Full FT on instruct models should usually be 1 epoch to preserve general capabilities.\n\n* **Weight Decay (`0.1`):** Standard.\n\n* **Batch Size:** Effective batch size of 64 (4 * 4 * 4) is good.\n\n* **Max Length (`1024`):** **Too short for Legal.**\n\n\n    \\*   Legal texts are long. Contracts, rulings, and statutes often exceed 1024 tokens.\n\n    \\*   If you truncate at 1024, you might cut off the actual legal reasoning or the answer.\n\n    \\*   \\*\\*Recommendation:\\*\\* If VRAM allows, push this to \\*\\*2048\\*\\* or \\*\\*4096\\*\\*. Qwen2.5 supports up to 32k natively. If you must stick to 1024 for the sanity check, be aware your real deployment will need longer context.\n\n\n* **Precision:** `bf16` is mandatory. Do not use `fp16` if you have Ampere (A100/A10) or Hopper (H100) GPUs. `fp16` can lead to loss spikes on larger vocabularies.\n\n### 5. Qwen-Specific Gotchas\n\n1. **BOS Token:** Qwen tokenizers sometimes behave oddly with the Beginning of Sequence (BOS) token when using `apply_chat_template`.\n\n\n    \\*   Ensure \\`tokenizer.apply_chat_template(..., add_generation_prompt=False)\\` is used for training samples.\n\n    \\*   Check that your tokenized input starts with the correct special tokens (usually \\`<|im_start|>\\` for Qwen2.5).\n\n\n2. **`trust_remote_code=True`:** You have this, which is good. Qwen requires it for the config and modeling code.\n\n3. **Optimizer:** `adamw_torch_fused` is excellent. It saves memory and is faster. Ensure your PyTorch version supports it (2.0+).\n\n4. **Gradient Checkpointing:** You have this on. Essential for Full FT to save VRAM.\n\n### 6. Revised Code Snippet (Key Fixes)\n\nHere is the adjusted setup focusing on the **masking** and **model name** issues.\n\n```python\n\nfrom transformers import AutoModelForCausalLM, AutoTokenizer\n\nfrom trl import SFTTrainer, SFTConfig\n\nimport torch\n\nimport datasets\n\n# 1. VERIFY MODEL NAME\n\n# Assuming Qwen2.5-3B-Instruct as 3.5-4B doesn’t exist publicly\n\nMODEL_NAME = “Qwen/Qwen2.5-3B-Instruct”\n\nMAX_LENGTH = 2048 # Increased for legal context\n\ntokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True)\n\nif tokenizer.pad_token is None:\n\n\n    tokenizer.pad_token = tokenizer.eos_token\n\n\ntokenizer.padding_side = “right” # Important for generation, though trainer handles it\n\nmodel = AutoModelForCausalLM.from_pretrained(\n\n\n    MODEL_NAME,\n\n    trust_remote_code=True,\n\n    torch_dtype=torch.bfloat16, # Force bf16\n\n    low_cpu_mem_usage=True,\n\n\n)\n\nmodel.config.use_cache = False\n\n# 2. PREPROCESSING FOR MASKING\n\n# SFTTrainer with ‘messages’ column does NOT automatically mask user tokens in all versions.\n\n# It is safer to preprocess into input_ids/labels with -100 masking.\n\ndef format_dataset(examples):\n\n\n    messages = examples\\[\"messages\"\\]\n\n    \\# Apply template to get full text\n\n    texts = \\[tokenizer.apply_chat_template(msg, tokenize=False, add_generation_prompt=False) for msg in messages\\]\n\n    tokenized = tokenizer(texts, truncation=True, max_length=MAX_LENGTH, padding=False)\n\n\n\n    \\# Create labels with masking\n\n    \\# This is a simplified masking logic; for production, ensure you mask based on role offsets\n\n    labels = \\[\\]\n\n    for i, msg in enumerate(messages):\n\n        current_labels = tokenized\\[\"input_ids\"\\]\\[i\\].copy()\n\n        \\# You need to iterate through the message turns and mask non-assistant tokens\n\n        \\# This requires tracking token offsets per turn, which is verbose.\n\n        \\# ALTERNATIVE: Use a library like 'unsloth' or ensure your TRL version supports\n\n        \\# dataset_kwargs for masking.\n\n        \\# For this sanity check, if you can't implement full masking logic,\n\n        \\# acknowledge the risk that user tokens will be trained on.\n\n        labels.append(current_labels)\n\n\n\n    tokenized\\[\"labels\"\\] = labels\n\n    return tokenized\n\n\n# Apply mapping\n\ntrain_ds = train_ds.map(format_dataset, batched=True, remove_columns=[“messages”])\n\neval_ds = eval_ds.map(format_dataset, batched=True, remove_columns=[“messages”])\n\nargs = SFTConfig(\n\n\n    output_dir=\"output\",\n\n    num_train_epochs=1,\n\n    learning_rate=1e-5,\n\n    weight_decay=0.1,\n\n    warmup_ratio=0.03,\n\n    lr_scheduler_type=\"cosine\",\n\n    per_device_train_batch_size=4,\n\n    gradient_accumulation_steps=4,\n\n    bf16=True, # Explicitly set\n\n    tf32=True,\n\n    gradient_checkpointing=True,\n\n    max_length=MAX_LENGTH,\n\n    eval_strategy=\"steps\",\n\n    eval_steps=100,\n\n    save_strategy=\"steps\",\n\n    save_steps=100,\n\n    \\# Important: We preprocessed, so we don't need SFTTrainer to do it\n\n    dataset_text_field=None,\n\n    remove_unused_columns=True, # Now we can remove unused\n\n\n)\n\ntrainer = SFTTrainer(\n\n\n    model=model,\n\n    args=args,\n\n    train_dataset=train_ds,\n\n    eval_dataset=eval_ds,\n\n    tokenizer=tokenizer,\n\n\n)\n\n```\n\n### Summary of Recommendations\n\n1. **Fix Model Name:** Confirm you are using `Qwen2.5-3B-Instruct` (or 7B).\n\n2. **Implement Masking:** Do not train on User tokens. If you can’t implement the offset masking logic easily, consider using a library like **Unsloth** which handles Qwen fine-tuning + masking very efficiently, or write the custom map function to set `labels` to `-100` for non-assistant turns.\n\n3. **Increase Context:** 1024 is likely too short for legal documents. Try 2048.\n\n4. **Keep Hyperparams:** Your LR, WD, and Scheduler choices are solid for Full FT.\n\nIf you fix the model name and ensure user-token masking, this is a very standard and reasonable recipe for domain adaptation.\n\ndef preprocess(examples):\n\n\n    chat_batch = examples\\[\"messages\"\\]\n\n    \\# Apply template to get the full text\n\n    texts = \\[tokenizer.apply_chat_template(msg, tokenize=False, add_generation_prompt=False) for msg in chat_batch\\]\n\n    \\# Tokenize\n\n    tokenized = tokenizer(texts, truncation=True, max_length=MAX_LENGTH, padding=False)\n\n\n\n    \\# Now create labels with masking\n\n    labels = \\[\\]\n\n    for i, msg in enumerate(chat_batch):\n\n        \\# Re-tokenize specifically to find where assistant turns start\n\n        \\# This is a simplified logic; robust masking requires tracking token offsets\n\n        \\# A simpler way with TRL is to use the 'SFTTrainer' packing or custom collator,\n\n        \\# BUT for full FT safety, explicit masking is best.\n\n        pass\n\n    \\# ... (See note below on easier way)\n\n\nfrom transformers import AutoModelForCausalLM, AutoTokenizer\n\nfrom trl import SFTTrainer, SFTConfig\n\nimport torch\n\nimport datasets\n\n# 1. VERIFY MODEL NAME\n\n# Assuming Qwen2.5-3B-Instruct as 3.5-4B doesn’t exist publicly\n\nMODEL_NAME = “Qwen/Qwen2.5-3B-Instruct”\n\nMAX_LENGTH = 2048 # Increased for legal context\n\ntokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True)\n\nif tokenizer.pad_token is None:\n\n\n    tokenizer.pad_token = tokenizer.eos_token\n\n\ntokenizer.padding_side = “right” # Important for generation, though trainer handles it\n\nmodel = AutoModelForCausalLM.from_pretrained(\n\n\n    MODEL_NAME,\n\n    trust_remote_code=True,\n\n    torch_dtype=torch.bfloat16, # Force bf16\n\n    low_cpu_mem_usage=True,\n\n\n)\n\nmodel.config.use_cache = False\n\n# 2. PREPROCESSING FOR MASKING\n\n# SFTTrainer with ‘messages’ column does NOT automatically mask user tokens in all versions.\n\n# It is safer to preprocess into input_ids/labels with -100 masking.\n\ndef format_dataset(examples):\n\n\n    messages = examples\\[\"messages\"\\]\n\n    \\# Apply template to get full text\n\n    texts = \\[tokenizer.apply_chat_template(msg, tokenize=False, add_generation_prompt=False) for msg in messages\\]\n\n    tokenized = tokenizer(texts, truncation=True, max_length=MAX_LENGTH, padding=False)\n\n\n\n    \\# Create labels with masking\n\n    \\# This is a simplified masking logic; for production, ensure you mask based on role offsets\n\n    labels = \\[\\]\n\n    for i, msg in enumerate(messages):\n\n        current_labels = tokenized\\[\"input_ids\"\\]\\[i\\].copy()\n\n        \\# You need to iterate through the message turns and mask non-assistant tokens\n\n        \\# This requires tracking token offsets per turn, which is verbose.\n\n        \\# ALTERNATIVE: Use a library like 'unsloth' or ensure your TRL version supports\n\n        \\# dataset_kwargs for masking.\n\n        \\# For this sanity check, if you can't implement full masking logic,\n\n        \\# acknowledge the risk that user tokens will be trained on.\n\n        labels.append(current_labels)\n\n\n\n    tokenized\\[\"labels\"\\] = labels\n\n    return tokenized\n\n\n# Apply mapping\n\ntrain_ds = train_ds.map(format_dataset, batched=True, remove_columns=[“messages”])\n\neval_ds = eval_ds.map(format_dataset, batched=True, remove_columns=[“messages”])\n\nargs = SFTConfig(\n\n\n    output_dir=\"output\",\n\n    num_train_epochs=1,\n\n    learning_rate=1e-5,\n\n    weight_decay=0.1,\n\n    warmup_ratio=0.03,\n\n    lr_scheduler_type=\"cosine\",\n\n    per_device_train_batch_size=4,\n\n    gradient_accumulation_steps=4,\n\n    bf16=True, # Explicitly set\n\n    tf32=True,\n\n    gradient_checkpointing=True,\n\n    max_length=MAX_LENGTH,\n\n    eval_strategy=\"steps\",\n\n    eval_steps=100,\n\n    save_strategy=\"steps\",\n\n    save_steps=100,\n\n    \\# Important: We preprocessed, so we don't need SFTTrainer to do it\n\n    dataset_text_field=None,\n\n    remove_unused_columns=True, # Now we can remove unused\n\n\n)\n\ntrainer = SFTTrainer(\n\n\n    model=model,\n\n    args=args,\n\n    train_dataset=train_ds,\n\n    eval_dataset=eval_ds,\n\n    tokenizer=tokenizer,\n\n\n)",
  "title": "Is this a common/reasonable recipe for full finetuning Qwen3.5-4B?"
}