{
"$type": "site.standard.document",
"bskyPostRef": {
"cid": "bafyreif4kdtbtfank3bc2h3voyka5j4ilovjocttcpvaq7ecdn5p63prka",
"uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mmq3snu5n4e2"
},
"path": "/t/vlm-fine-tuning-near-zero-training-loss-but-poor-inference-accuracy-on-train-set-gemma-4-e2b-it/176224#post_2",
"publishedAt": "2026-05-26T02:16:23.000Z",
"site": "https://discuss.huggingface.co",
"tags": [
"TRL SFTTrainer docs",
"TRL issue #3751 — VLM SFT example computes loss for the entire sequence, including prompt/user content",
"TRL issue #5471 — assistant_only_loss=True requires {% generation %} / {% endgeneration %} markers",
"TRL issue #3781 — assistant_only_loss=True silently ignored with use_liger_kernel=True",
"HF Forum — SFTTrainer loss function and formatting_func",
"HF Forum — SFTTrainer works but without result",
"TRL SFTTrainer docs — VLM support",
"TRL VLM full-sequence loss issue",
"HF Forum — assistant_only_loss=True and VLM/processor path confusion",
"Transformers docs — Chat templates",
"Transformers docs — add_generation_prompt vs continue_final_message",
"Unsloth Gemma 4 Fine-tuning Guide",
"Unsloth chat templates docs",
"Google Gemma vision fine-tuning with Hugging Face",
"TRL SFTTrainer docs — PEFT integration",
"Google Gemma QLoRA guide",
"Medium — Fine-tuning Gemma 4 E2B step-by-step with Unsloth",
"HF Forum — VLM structured JSON/domain fine-tuning discussion",
"AWS — Fine-tune VLMs for multipage document-to-JSON",
"AWS sample repo — multimodal document-to-JSON with SageMaker AI",
"Transformers chat templates — generation prompts and continuing final messages",
"Unsloth vision fine-tuning docs",
"Google Gemma image understanding docs",
"AWS blog — Fine-tune VLMs for multipage document-to-JSON with SageMaker AI and SWIFT",
"AWS sample repo — sample-for-multi-modal-document-to-json-with-sagemaker-ai",
"Hugging Face Cookbook — Fine-Tuning a Vision Language Model Qwen2-VL-7B with TRL",
"Hugging Face Cookbook — Fine-tuning SmolVLM with TRL",
"Phil Schmid — Fine-tune multimodal LLMs / VLMs with TRL",
"Daniel van Strien — Fine-tuning VLMs for Art History with TRL and HF Jobs",
"AMD ROCm tutorial — Fine-tuning Qwen2-VL-7B on ChartQA with LoRA",
"huggingface/trl#3751",
"TRL SFTTrainer",
"Transformers chat templates"
],
"textContent": "For now, there seem to be reports of similar cases:\n\n* * *\n\nI would debug this as a **training/inference contract mismatch** before treating it as a Gemma 4 E2B-it capability problem.\n\nThe combination of:\n\n * near-zero training loss,\n * poor inference even on training images,\n * strong prediction skew toward one class,\n * a very short target answer such as `safe` / `unsafe` or JSON,\n * VLM SFT through a high-level wrapper/UI,\n\n\n\nis exactly the kind of pattern where the scalar loss may be telling you the model learned _some tokens_ , but not necessarily the task decision you care about.\n\nMy current guess would be:\n\n> The effective training target, effective inference prompt, and evaluation parser are probably not the same task contract.\n\nThe most likely causes, in order:\n\nPriority | Failure mode | Why it fits this symptom\n---|---|---\n1 | **Loss mask is wrong** : loss is computed on the full rendered conversation, not only the assistant label/JSON | Long prompt/template tokens can drive loss near zero while the few `safe`/`unsafe` tokens remain poorly learned\n2 | **Training and inference chat templates differ** | VLM/chat models are sensitive to role markers, image placeholders, EOS, and assistant-start tokens\n3 | **LoRA adapter/checkpoint/export is not actually used at inference** | Training loss can be real, while inference accidentally uses base model behavior\n4 | **Evaluation/parsing bug** | Parse failures or prompt-echoes can be misread as `unsafe`, creating artificial class skew\n5 | **Image/bbox/crop issue** | Possible, but I would check this only after the tiny-overfit and masking tests pass\n6 | **Gemma 4 / Unsloth / Transformers / TRL version issue** | Possible, but less useful to assume before inspecting the actual batch labels and rendered prompt\n\n## Why I would not trust the near-zero loss yet\n\nFor this task, the assistant answer is tiny:\n\n\n [{\"id\":\"0\",\"label\":\"unsafe\"}]\n\n\nBut the rendered training sequence may contain:\n\n * system prompt,\n * user instruction,\n * image placeholder tokens,\n * bbox text,\n * formatting/control tokens,\n * assistant JSON answer.\n\n\n\nIf the trainer computes loss on the whole rendered sequence, the model can reduce loss mostly by learning deterministic prompt/template tokens. The actual classification decision may be only a few tokens out of the whole sequence.\n\nThis is a known issue class in TRL/VLM SFT:\n\n * TRL SFTTrainer docs\n * TRL issue #3751 — VLM SFT example computes loss for the entire sequence, including prompt/user content\n * TRL issue #5471 — assistant_only_loss=True requires {% generation %} / {% endgeneration %} markers\n * TRL issue #3781 — assistant_only_loss=True silently ignored with use_liger_kernel=True\n * HF Forum — SFTTrainer loss function and formatting_func\n * HF Forum — SFTTrainer works but without result\n\n\n\nThe key TRL doc detail is that `completion_only_loss` and `assistant_only_loss` are separate from ordinary full-sequence language-modeling loss. For prompt-completion datasets, completion-only loss can supervise only the completion. For conversational assistant-only training, the chat template must be able to return assistant/generation masks.\n\nSo the first question is not “why is loss low?” but:\n\n> Which tokens actually have labels other than `-100`?\n\n## Check 1: inspect the real supervised tokens\n\nThis is the single most important test.\n\n\n batch = next(iter(trainer.get_train_dataloader()))\n\n input_ids = batch[\"input_ids\"][0]\n labels = batch[\"labels\"][0]\n\n mask = labels != -100\n\n print(\"input length:\", input_ids.numel())\n print(\"supervised token count:\", mask.sum().item())\n print(tokenizer.decode(input_ids[mask], skip_special_tokens=False))\n\n\nExpected output should be close to only the assistant target:\n\n\n [{\"id\":\"0\",\"label\":\"unsafe\"}]\n\n\nBad output:\n\n\n You are a safety vision model...\n Inspect the stepladder...\n Ladder bbox: ...\n [{\"id\":\"0\",\"label\":\"unsafe\"}]\n\n\nVery bad output:\n\n\n <image> <pad> <eos>\n\n\nor almost no supervised tokens.\n\nInterpretation:\n\nWhat `labels != -100` decodes to | Interpretation\n---|---\nOnly assistant JSON / only `safe` or `unsafe` | Loss target is probably OK\nSystem/user prompt + assistant answer | Loss is probably diluted by prompt/template tokens\nImage/pad/special tokens | Collator/token masking is likely wrong\nEmpty or almost empty | Truncation/template mask may be broken\nAssistant answer but missing label token | Truncation or bad target formatting\n\nIf the supervised region is not the assistant answer only, I would not tune learning rate, epochs, rank, or vision layers yet. Fix the objective first.\n\n## Check 2: verify assistant-only / completion-only masking\n\nIf you can use prompt-completion form, prefer making the split explicit:\n\n\n example = {\n \"prompt\": [\n {\n \"role\": \"system\",\n \"content\": \"Classify stepladder use as safe or unsafe. Output JSON only.\"\n },\n {\n \"role\": \"user\",\n \"content\": [\n {\"type\": \"image\", \"image\": image},\n {\"type\": \"text\", \"text\": \"Inspect the ladder. Ladder bbox: [x1,y1,x2,y2].\"}\n ],\n },\n ],\n \"completion\": [\n {\n \"role\": \"assistant\",\n \"content\": '[{\"id\":\"0\",\"label\":\"unsafe\"}]'\n }\n ],\n }\n\n\nThe final label contract should be:\n\nToken region | Label\n---|---\nsystem prompt | `-100`\nuser text | `-100`\nimage tokens | `-100`\npad tokens | `-100`\nassistant JSON / class label | token IDs\n\nFor VLMs, you may need a custom collator rather than assuming the text-only assistant masking path works automatically. This is especially important because VLM processors and chat templates may go through a different path from ordinary text tokenizers.\n\nRelated resources:\n\n * TRL SFTTrainer docs — VLM support\n * TRL VLM full-sequence loss issue\n * HF Forum — assistant_only_loss=True and VLM/processor path confusion\n\n\n\n## Check 3: compare training vs inference chat rendering\n\nA second high-probability failure mode is that training and inference do not render the same chat contract.\n\nUseful references:\n\n * Transformers docs — Chat templates\n * Transformers docs — add_generation_prompt vs continue_final_message\n * Unsloth Gemma 4 Fine-tuning Guide\n * Unsloth chat templates docs\n * Google Gemma vision fine-tuning with Hugging Face\n\n\n\nPrint the exact rendered training string and inference string:\n\n\n train_text = processor.apply_chat_template(\n train_messages,\n tokenize=False,\n add_generation_prompt=False,\n )\n\n infer_text = processor.apply_chat_template(\n infer_messages_without_assistant,\n tokenize=False,\n add_generation_prompt=True,\n )\n\n print(\"===== TRAIN RENDERED =====\")\n print(train_text)\n print(\"===== INFER RENDERED =====\")\n print(infer_text)\n\n\nCheck:\n\n * same system prompt;\n * same user instruction;\n * same role markers;\n * image placeholder appears in the same position;\n * multimodal content order is consistent, usually image before text for Gemma-style multimodal prompts;\n * no duplicated BOS/EOS;\n * inference contains the correct assistant-start marker;\n * training does not accidentally include a generation prompt before the gold answer;\n * exported runtime uses the same chat template and EOS token.\n\n\n\nThis matters because chat models do not directly consume abstract Python dictionaries like:\n\n\n {\"role\": \"user\", \"content\": \"...\"}\n\n\nThey consume a rendered token sequence. If the rendered sequence differs, the model may be seeing a different task.\n\n## Check 4: do a nonce overfit to verify adapter/checkpoint/export\n\nIf this is LoRA/QLoRA, the fine-tuned behavior lives in the adapter unless it is correctly merged/exported.\n\nDo a tiny debug run:\n\n 1. create 4 examples;\n 2. add one impossible target label;\n 3. train briefly;\n 4. run inference on the exact same example.\n\n\n\nExample target:\n\n\n [{\"id\":\"0\",\"label\":\"DEBUG_TOKEN_7F3A\"}]\n\n\nInterpretation:\n\nResult | Meaning\n---|---\nModel emits `DEBUG_TOKEN_7F3A` in the same training environment | Adapter and training path probably work\nModel cannot emit the nonce even on the training sample | Adapter, labels, template, or training loop is suspect\nStudio/in-training inference emits nonce, exported model does not | Export/runtime/template/EOS issue\nBase and LoRA outputs are almost identical | Adapter may not be loaded or active\nMerged model differs from base+adapter | Merge/export path may be wrong\n\nUseful references:\n\n * TRL SFTTrainer docs — PEFT integration\n * Google Gemma QLoRA guide\n * Unsloth Gemma 4 Fine-tuning Guide\n * Medium — Fine-tuning Gemma 4 E2B step-by-step with Unsloth\n\n\n\n## Check 5: decode generated tokens only\n\nFor evaluation, do not decode prompt + generation together.\n\nUse generated-only decoding:\n\n\n outputs = model.generate(\n **inputs,\n max_new_tokens=32,\n do_sample=False,\n )\n\n prompt_len = inputs[\"input_ids\"].shape[1]\n generated_ids = outputs[:, prompt_len:]\n\n text = processor.batch_decode(\n generated_ids,\n skip_special_tokens=True,\n )[0]\n\n print(text)\n\n\nThen separate:\n\n * JSON parse success,\n * field/key extraction success,\n * label extraction success,\n * class distribution,\n * final accuracy.\n\n\n\nDo not default parse failures to `unsafe`.\n\nBad:\n\n\n if parse_failed:\n pred = \"unsafe\"\n\n\nBetter:\n\n\n if parse_failed:\n pred = \"PARSE_FAIL\"\n elif label not in {\"safe\", \"unsafe\"}:\n pred = \"INVALID_LABEL\"\n else:\n pred = label\n\n\nSuggested report:\n\n\n strict_json_parse_rate\n label_extraction_rate\n parse_fail_count\n invalid_label_count\n safe_count\n unsafe_count\n accuracy_on_parseable_outputs\n overall_accuracy\n\n\nA strong `unsafe` skew can be caused by model bias, but it can also be caused by parse-failure fallback.\n\nRelated resources:\n\n * HF Forum — VLM structured JSON/domain fine-tuning discussion\n * AWS — Fine-tune VLMs for multipage document-to-JSON\n * AWS sample repo — multimodal document-to-JSON with SageMaker AI\n\n\n\n## Check 6: temporarily remove JSON\n\nBefore debugging visual reasoning and JSON formatting at the same time, simplify the target:\n\n\n unsafe\n\n\nor:\n\n\n safe\n\n\nTiny-overfit test:\n\nTest | Meaning\n---|---\n4 examples, target only `safe`/`unsafe`, train-set accuracy near 100% | Basic adapter + visual/task path works\n`safe`/`unsafe` works, JSON fails | JSON formatting/parser/decode contract is the issue\n`safe`/`unsafe` also fails on 4 examples | Objective, adapter, template, or image input is still broken\nJSON parse fails but label appears in raw text | Parser/evaluator is too strict\nLabel is never generated | Training target or inference prompt likely wrong\n\nOnce this passes, reintroduce JSON:\n\n\n [{\"id\":\"0\",\"label\":\"unsafe\"}]\n\n\nIf JSON must be stable, consider prefill:\n\n\n [{\"id\":\"0\",\"label\":\"\n\n\nThen generate only the label continuation. In Transformers terminology, this is closer to continuing the final assistant message than starting a new assistant message, so be careful with `add_generation_prompt` vs `continue_final_message`.\n\nReference:\n\n * Transformers chat templates — generation prompts and continuing final messages\n\n\n\n## Check 7: use constrained or low-entropy decoding for classification\n\nFor debugging, use deterministic decoding:\n\n\n outputs = model.generate(\n **inputs,\n max_new_tokens=16,\n do_sample=False,\n )\n\n\nFor a binary task, you can also compare label token scores instead of free-form generation:\n\n\n # Conceptual sketch:\n # Prompt ends with: [{\"id\":\"0\",\"label\":\"\n # Compare next-token / next-string probability for \"safe\" vs \"unsafe\"\n\n\nThis removes:\n\n * sampling noise,\n * malformed JSON,\n * explanation text,\n * markdown fences,\n * run-on generation.\n\n\n\nIf logit comparison works but full JSON generation fails, the classification signal may be present but the output contract is unstable.\n\n## Check 8: only then investigate image/bbox design\n\nOnce tiny overfit, adapter loading, label masking, template rendering, and evaluation are proven correct, then test the visual side.\n\nCompare:\n\n 1. full image only;\n 2. ladder crop only;\n 3. full image + ladder crop;\n 4. full image with bbox drawn;\n 5. different resolutions / visual token budgets;\n 6. frozen vision layers vs vision LoRA;\n 7. language-only LoRA vs vision+language LoRA.\n\n\n\nFor a safety/bbox task, raw coordinate text may be less effective than giving the model either a crop or a visible marked region.\n\nUseful references:\n\n * Unsloth vision fine-tuning docs\n * Unsloth Gemma 4 Fine-tuning Guide\n * Google Gemma image understanding docs\n * Google Gemma vision fine-tuning with Hugging Face\n\n\n\n## Related examples where the general method works\n\nThis does not prove the exact stepladder dataset should work immediately, but it shows that the overall approach is valid when the data contract and evaluation contract are correct.\n\n### Structured image-to-JSON VLM fine-tuning\n\nAWS has a document-to-JSON VLM fine-tuning example and sample repo. Their repo reports that smaller models such as Qwen2.5-VL 3B can achieve high exact extraction accuracy on a document-to-JSON task after fine-tuning.\n\n * AWS blog — Fine-tune VLMs for multipage document-to-JSON with SageMaker AI and SWIFT\n * AWS sample repo — sample-for-multi-modal-document-to-json-with-sagemaker-ai\n\n\n\nThis is conceptually close to:\n\n\n image -> structured JSON\n\n\nYour task is:\n\n\n worksite image -> structured JSON label\n\n\nSo I would not conclude that “VLMs cannot do this”. I would first suspect the pipeline.\n\n### VLM SFT with TRL\n\nThere are multiple public VLM SFT recipes using TRL:\n\n * Hugging Face Cookbook — Fine-Tuning a Vision Language Model Qwen2-VL-7B with TRL\n * Hugging Face Cookbook — Fine-tuning SmolVLM with TRL\n * Phil Schmid — Fine-tune multimodal LLMs / VLMs with TRL\n * Daniel van Strien — Fine-tuning VLMs for Art History with TRL and HF Jobs\n * AMD ROCm tutorial — Fine-tuning Qwen2-VL-7B on ChartQA with LoRA\n\n\n\nThese examples are useful because they establish a baseline: VLM SFT itself is a normal workflow. If a model cannot overfit even 4 training examples, that is usually a contract/debug issue, not a reason to start with large hyperparameter sweeps.\n\n## Minimal debug sequence I would run\n\n### Phase A — freeze evidence\n\nRecord versions and runtime:\n\n\n import torch, transformers, trl, peft\n\n print(\"torch:\", torch.__version__)\n print(\"transformers:\", transformers.__version__)\n print(\"trl:\", trl.__version__)\n print(\"peft:\", peft.__version__)\n\n try:\n import unsloth\n print(\"unsloth:\", getattr(unsloth, \"__version__\", \"unknown\"))\n except Exception as e:\n print(\"unsloth import error:\", repr(e))\n\n\nAlso record:\n\n\n base model revision\n adapter checkpoint path\n export format\n processor/tokenizer path\n chat template\n EOS token\n PAD token\n image processor settings\n max_seq_length\n max_new_tokens\n do_sample\n\n\n### Phase B — 4-example overfit\n\nTrain on 4 examples.\n\nUse one target like:\n\n\n [{\"id\":\"0\",\"label\":\"DEBUG_TOKEN_7F3A\"}]\n\n\nExpected: exact training examples should be reproduced.\n\nIf this fails, stop and inspect adapter/template/labels.\n\n### Phase C — inspect batch labels\n\n\n batch = next(iter(trainer.get_train_dataloader()))\n\n input_ids = batch[\"input_ids\"][0]\n labels = batch[\"labels\"][0]\n mask = labels != -100\n\n print(tokenizer.decode(input_ids[mask], skip_special_tokens=False))\n\n\nExpected: only assistant answer.\n\nIf not, fix collator/objective.\n\n### Phase D — compare rendered templates\n\n\n train_text = processor.apply_chat_template(\n train_messages,\n tokenize=False,\n add_generation_prompt=False,\n )\n\n infer_text = processor.apply_chat_template(\n infer_messages_without_assistant,\n tokenize=False,\n add_generation_prompt=True,\n )\n\n print(train_text)\n print(infer_text)\n\n\nExpected: same task prefix, correct assistant generation start.\n\n### Phase E — generated-only evaluation\n\n\n outputs = model.generate(\n **inputs,\n max_new_tokens=32,\n do_sample=False,\n )\n\n prompt_len = inputs[\"input_ids\"].shape[1]\n generated_ids = outputs[:, prompt_len:]\n text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]\n\n\nThen report parse metrics separately from classification metrics.\n\n### Phase F — simplify output\n\nFirst train:\n\n\n unsafe\n\n\nnot:\n\n\n [{\"id\":\"0\",\"label\":\"unsafe\"}]\n\n\nOnce binary output works, reintroduce JSON.\n\n### Phase G — visual ablations\n\nOnly after A-F pass:\n\n\n full image\n crop image\n full + crop\n full with drawn bbox\n vision frozen\n vision LoRA\n language-only LoRA\n\n\n## Practical fixes depending on what fails\n\nFailed check | Likely fix\n---|---\n`labels != -100` includes prompt/user text | Use prompt-completion dataset, assistant-only/completion-only loss, or custom VLM collator\n`labels != -100` is empty or missing answer | Increase max length, fix truncation, check chat template generation mask\nBase and LoRA outputs identical | Verify adapter loading, checkpoint path, active adapter, merge/export\nTraining render and inference render differ | Use same processor/tokenizer/chat template/EOS; fix `add_generation_prompt`/prefill semantics\nGenerated text is valid-ish but parser fails | Make parser tolerant or use constrained decoding/prefill\nParse failures counted as `unsafe` | Add `PARSE_FAIL` class in evaluation\n`safe`/`unsafe` works but JSON fails | Keep classification simple, then add JSON prefill or constrained decoding\nTiny overfit works, full train still poor | Then inspect class imbalance, ambiguous labels, bbox/crop, resolution, and dataset quality\n\n## Things I would not change first\n\nI would not start with:\n\n * more epochs,\n * larger LoRA rank,\n * more data,\n * more complex system prompt,\n * vision layers everywhere,\n * higher resolution,\n * LR sweeps,\n * bigger model,\n\n\n\nuntil these four statements are true:\n\n 1. The adapter is active during inference.\n 2. `labels != -100` decodes only to the assistant answer.\n 3. Training and inference render the same chat task.\n 4. Evaluation decodes only generated tokens and does not default parse failures to `unsafe`.\n\n\n\nIf any of those are false, hyperparameter tuning can make the logs look different without fixing the underlying contract.\n\n## My likely diagnosis\n\nMy strongest hypothesis is:\n\n> The training loss is low because the model is optimizing an easier token objective than the intended stepladder safety decision, or because inference is not using the same adapter/template/EOS contract as training.\n\nThe closest public issue is the TRL VLM full-sequence-loss discussion:\n\n * huggingface/trl#3751\n\n\n\nThe most relevant official docs are:\n\n * TRL SFTTrainer\n * Transformers chat templates\n * Unsloth Gemma 4 Fine-tuning Guide\n\n\n\nThe shortest reliable path is:\n\n\n 4-example nonce overfit\n -> inspect labels != -100\n -> compare rendered chat templates\n -> generated-only decode\n -> binary safe/unsafe target\n -> JSON target\n -> bbox/crop/vision ablations\n\n\nIf the model cannot pass the 4-example nonce overfit with correct assistant-only labels, I would not consider the original accuracy number meaningful yet.",
"title": "VLM Fine tuning: Near-Zero Training Loss but Poor Inference Accuracy on Train Set (Gemma 4 E2B It)"
}