VLM Fine tuning: Near-Zero Training Loss but Poor Inference Accuracy on Train Set (Gemma 4 E2B It)
For now, there seem to be reports of similar cases:
I would debug this as a training/inference contract mismatch before treating it as a Gemma 4 E2B-it capability problem.
The combination of:
- near-zero training loss,
- poor inference even on training images,
- strong prediction skew toward one class,
- a very short target answer such as
safe/unsafeor JSON, - VLM SFT through a high-level wrapper/UI,
is exactly the kind of pattern where the scalar loss may be telling you the model learned some tokens , but not necessarily the task decision you care about.
My current guess would be:
The effective training target, effective inference prompt, and evaluation parser are probably not the same task contract.
The most likely causes, in order:
| Priority | Failure mode | Why it fits this symptom |
|---|---|---|
| 1 | Loss mask is wrong : loss is computed on the full rendered conversation, not only the assistant label/JSON | Long prompt/template tokens can drive loss near zero while the few safe/unsafe tokens remain poorly learned |
| 2 | Training and inference chat templates differ | VLM/chat models are sensitive to role markers, image placeholders, EOS, and assistant-start tokens |
| 3 | LoRA adapter/checkpoint/export is not actually used at inference | Training loss can be real, while inference accidentally uses base model behavior |
| 4 | Evaluation/parsing bug | Parse failures or prompt-echoes can be misread as unsafe, creating artificial class skew |
| 5 | Image/bbox/crop issue | Possible, but I would check this only after the tiny-overfit and masking tests pass |
| 6 | Gemma 4 / Unsloth / Transformers / TRL version issue | Possible, but less useful to assume before inspecting the actual batch labels and rendered prompt |
Why I would not trust the near-zero loss yet
For this task, the assistant answer is tiny:
[{"id":"0","label":"unsafe"}]
But the rendered training sequence may contain:
- system prompt,
- user instruction,
- image placeholder tokens,
- bbox text,
- formatting/control tokens,
- assistant JSON answer.
If the trainer computes loss on the whole rendered sequence, the model can reduce loss mostly by learning deterministic prompt/template tokens. The actual classification decision may be only a few tokens out of the whole sequence.
This is a known issue class in TRL/VLM SFT:
- TRL SFTTrainer docs
- TRL issue #3751 — VLM SFT example computes loss for the entire sequence, including prompt/user content
- TRL issue #5471 — assistant_only_loss=True requires {% generation %} / {% endgeneration %} markers
- TRL issue #3781 — assistant_only_loss=True silently ignored with use_liger_kernel=True
- HF Forum — SFTTrainer loss function and formatting_func
- HF Forum — SFTTrainer works but without result
The key TRL doc detail is that completion_only_loss and assistant_only_loss are separate from ordinary full-sequence language-modeling loss. For prompt-completion datasets, completion-only loss can supervise only the completion. For conversational assistant-only training, the chat template must be able to return assistant/generation masks.
So the first question is not “why is loss low?” but:
Which tokens actually have labels other than
-100?
Check 1: inspect the real supervised tokens
This is the single most important test.
batch = next(iter(trainer.get_train_dataloader()))
input_ids = batch["input_ids"][0]
labels = batch["labels"][0]
mask = labels != -100
print("input length:", input_ids.numel())
print("supervised token count:", mask.sum().item())
print(tokenizer.decode(input_ids[mask], skip_special_tokens=False))
Expected output should be close to only the assistant target:
[{"id":"0","label":"unsafe"}]
Bad output:
You are a safety vision model...
Inspect the stepladder...
Ladder bbox: ...
[{"id":"0","label":"unsafe"}]
Very bad output:
<image> <pad> <eos>
or almost no supervised tokens.
Interpretation:
What labels != -100 decodes to |
Interpretation |
|---|---|
Only assistant JSON / only safe or unsafe |
Loss target is probably OK |
| System/user prompt + assistant answer | Loss is probably diluted by prompt/template tokens |
| Image/pad/special tokens | Collator/token masking is likely wrong |
| Empty or almost empty | Truncation/template mask may be broken |
| Assistant answer but missing label token | Truncation or bad target formatting |
If the supervised region is not the assistant answer only, I would not tune learning rate, epochs, rank, or vision layers yet. Fix the objective first.
Check 2: verify assistant-only / completion-only masking
If you can use prompt-completion form, prefer making the split explicit:
example = {
"prompt": [
{
"role": "system",
"content": "Classify stepladder use as safe or unsafe. Output JSON only."
},
{
"role": "user",
"content": [
{"type": "image", "image": image},
{"type": "text", "text": "Inspect the ladder. Ladder bbox: [x1,y1,x2,y2]."}
],
},
],
"completion": [
{
"role": "assistant",
"content": '[{"id":"0","label":"unsafe"}]'
}
],
}
The final label contract should be:
| Token region | Label |
|---|---|
| system prompt | -100 |
| user text | -100 |
| image tokens | -100 |
| pad tokens | -100 |
| assistant JSON / class label | token IDs |
For VLMs, you may need a custom collator rather than assuming the text-only assistant masking path works automatically. This is especially important because VLM processors and chat templates may go through a different path from ordinary text tokenizers.
Related resources:
- TRL SFTTrainer docs — VLM support
- TRL VLM full-sequence loss issue
- HF Forum — assistant_only_loss=True and VLM/processor path confusion
Check 3: compare training vs inference chat rendering
A second high-probability failure mode is that training and inference do not render the same chat contract.
Useful references:
- Transformers docs — Chat templates
- Transformers docs — add_generation_prompt vs continue_final_message
- Unsloth Gemma 4 Fine-tuning Guide
- Unsloth chat templates docs
- Google Gemma vision fine-tuning with Hugging Face
Print the exact rendered training string and inference string:
train_text = processor.apply_chat_template(
train_messages,
tokenize=False,
add_generation_prompt=False,
)
infer_text = processor.apply_chat_template(
infer_messages_without_assistant,
tokenize=False,
add_generation_prompt=True,
)
print("===== TRAIN RENDERED =====")
print(train_text)
print("===== INFER RENDERED =====")
print(infer_text)
Check:
- same system prompt;
- same user instruction;
- same role markers;
- image placeholder appears in the same position;
- multimodal content order is consistent, usually image before text for Gemma-style multimodal prompts;
- no duplicated BOS/EOS;
- inference contains the correct assistant-start marker;
- training does not accidentally include a generation prompt before the gold answer;
- exported runtime uses the same chat template and EOS token.
This matters because chat models do not directly consume abstract Python dictionaries like:
{"role": "user", "content": "..."}
They consume a rendered token sequence. If the rendered sequence differs, the model may be seeing a different task.
Check 4: do a nonce overfit to verify adapter/checkpoint/export
If this is LoRA/QLoRA, the fine-tuned behavior lives in the adapter unless it is correctly merged/exported.
Do a tiny debug run:
- create 4 examples;
- add one impossible target label;
- train briefly;
- run inference on the exact same example.
Example target:
[{"id":"0","label":"DEBUG_TOKEN_7F3A"}]
Interpretation:
| Result | Meaning |
|---|---|
Model emits DEBUG_TOKEN_7F3A in the same training environment |
Adapter and training path probably work |
| Model cannot emit the nonce even on the training sample | Adapter, labels, template, or training loop is suspect |
| Studio/in-training inference emits nonce, exported model does not | Export/runtime/template/EOS issue |
| Base and LoRA outputs are almost identical | Adapter may not be loaded or active |
| Merged model differs from base+adapter | Merge/export path may be wrong |
Useful references:
- TRL SFTTrainer docs — PEFT integration
- Google Gemma QLoRA guide
- Unsloth Gemma 4 Fine-tuning Guide
- Medium — Fine-tuning Gemma 4 E2B step-by-step with Unsloth
Check 5: decode generated tokens only
For evaluation, do not decode prompt + generation together.
Use generated-only decoding:
outputs = model.generate(
**inputs,
max_new_tokens=32,
do_sample=False,
)
prompt_len = inputs["input_ids"].shape[1]
generated_ids = outputs[:, prompt_len:]
text = processor.batch_decode(
generated_ids,
skip_special_tokens=True,
)[0]
print(text)
Then separate:
- JSON parse success,
- field/key extraction success,
- label extraction success,
- class distribution,
- final accuracy.
Do not default parse failures to unsafe.
Bad:
if parse_failed:
pred = "unsafe"
Better:
if parse_failed:
pred = "PARSE_FAIL"
elif label not in {"safe", "unsafe"}:
pred = "INVALID_LABEL"
else:
pred = label
Suggested report:
strict_json_parse_rate
label_extraction_rate
parse_fail_count
invalid_label_count
safe_count
unsafe_count
accuracy_on_parseable_outputs
overall_accuracy
A strong unsafe skew can be caused by model bias, but it can also be caused by parse-failure fallback.
Related resources:
- HF Forum — VLM structured JSON/domain fine-tuning discussion
- AWS — Fine-tune VLMs for multipage document-to-JSON
- AWS sample repo — multimodal document-to-JSON with SageMaker AI
Check 6: temporarily remove JSON
Before debugging visual reasoning and JSON formatting at the same time, simplify the target:
unsafe
or:
safe
Tiny-overfit test:
| Test | Meaning |
|---|---|
4 examples, target only safe/unsafe, train-set accuracy near 100% |
Basic adapter + visual/task path works |
safe/unsafe works, JSON fails |
JSON formatting/parser/decode contract is the issue |
safe/unsafe also fails on 4 examples |
Objective, adapter, template, or image input is still broken |
| JSON parse fails but label appears in raw text | Parser/evaluator is too strict |
| Label is never generated | Training target or inference prompt likely wrong |
Once this passes, reintroduce JSON:
[{"id":"0","label":"unsafe"}]
If JSON must be stable, consider prefill:
[{"id":"0","label":"
Then generate only the label continuation. In Transformers terminology, this is closer to continuing the final assistant message than starting a new assistant message, so be careful with add_generation_prompt vs continue_final_message.
Reference:
- Transformers chat templates — generation prompts and continuing final messages
Check 7: use constrained or low-entropy decoding for classification
For debugging, use deterministic decoding:
outputs = model.generate(
**inputs,
max_new_tokens=16,
do_sample=False,
)
For a binary task, you can also compare label token scores instead of free-form generation:
# Conceptual sketch:
# Prompt ends with: [{"id":"0","label":"
# Compare next-token / next-string probability for "safe" vs "unsafe"
This removes:
- sampling noise,
- malformed JSON,
- explanation text,
- markdown fences,
- run-on generation.
If logit comparison works but full JSON generation fails, the classification signal may be present but the output contract is unstable.
Check 8: only then investigate image/bbox design
Once tiny overfit, adapter loading, label masking, template rendering, and evaluation are proven correct, then test the visual side.
Compare:
- full image only;
- ladder crop only;
- full image + ladder crop;
- full image with bbox drawn;
- different resolutions / visual token budgets;
- frozen vision layers vs vision LoRA;
- language-only LoRA vs vision+language LoRA.
For a safety/bbox task, raw coordinate text may be less effective than giving the model either a crop or a visible marked region.
Useful references:
- Unsloth vision fine-tuning docs
- Unsloth Gemma 4 Fine-tuning Guide
- Google Gemma image understanding docs
- Google Gemma vision fine-tuning with Hugging Face
Related examples where the general method works
This does not prove the exact stepladder dataset should work immediately, but it shows that the overall approach is valid when the data contract and evaluation contract are correct.
Structured image-to-JSON VLM fine-tuning
AWS has a document-to-JSON VLM fine-tuning example and sample repo. Their repo reports that smaller models such as Qwen2.5-VL 3B can achieve high exact extraction accuracy on a document-to-JSON task after fine-tuning.
- AWS blog — Fine-tune VLMs for multipage document-to-JSON with SageMaker AI and SWIFT
- AWS sample repo — sample-for-multi-modal-document-to-json-with-sagemaker-ai
This is conceptually close to:
image -> structured JSON
Your task is:
worksite image -> structured JSON label
So I would not conclude that “VLMs cannot do this”. I would first suspect the pipeline.
VLM SFT with TRL
There are multiple public VLM SFT recipes using TRL:
- Hugging Face Cookbook — Fine-Tuning a Vision Language Model Qwen2-VL-7B with TRL
- Hugging Face Cookbook — Fine-tuning SmolVLM with TRL
- Phil Schmid — Fine-tune multimodal LLMs / VLMs with TRL
- Daniel van Strien — Fine-tuning VLMs for Art History with TRL and HF Jobs
- AMD ROCm tutorial — Fine-tuning Qwen2-VL-7B on ChartQA with LoRA
These examples are useful because they establish a baseline: VLM SFT itself is a normal workflow. If a model cannot overfit even 4 training examples, that is usually a contract/debug issue, not a reason to start with large hyperparameter sweeps.
Minimal debug sequence I would run
Phase A — freeze evidence
Record versions and runtime:
import torch, transformers, trl, peft
print("torch:", torch.__version__)
print("transformers:", transformers.__version__)
print("trl:", trl.__version__)
print("peft:", peft.__version__)
try:
import unsloth
print("unsloth:", getattr(unsloth, "__version__", "unknown"))
except Exception as e:
print("unsloth import error:", repr(e))
Also record:
base model revision
adapter checkpoint path
export format
processor/tokenizer path
chat template
EOS token
PAD token
image processor settings
max_seq_length
max_new_tokens
do_sample
Phase B — 4-example overfit
Train on 4 examples.
Use one target like:
[{"id":"0","label":"DEBUG_TOKEN_7F3A"}]
Expected: exact training examples should be reproduced.
If this fails, stop and inspect adapter/template/labels.
Phase C — inspect batch labels
batch = next(iter(trainer.get_train_dataloader()))
input_ids = batch["input_ids"][0]
labels = batch["labels"][0]
mask = labels != -100
print(tokenizer.decode(input_ids[mask], skip_special_tokens=False))
Expected: only assistant answer.
If not, fix collator/objective.
Phase D — compare rendered templates
train_text = processor.apply_chat_template(
train_messages,
tokenize=False,
add_generation_prompt=False,
)
infer_text = processor.apply_chat_template(
infer_messages_without_assistant,
tokenize=False,
add_generation_prompt=True,
)
print(train_text)
print(infer_text)
Expected: same task prefix, correct assistant generation start.
Phase E — generated-only evaluation
outputs = model.generate(
**inputs,
max_new_tokens=32,
do_sample=False,
)
prompt_len = inputs["input_ids"].shape[1]
generated_ids = outputs[:, prompt_len:]
text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
Then report parse metrics separately from classification metrics.
Phase F — simplify output
First train:
unsafe
not:
[{"id":"0","label":"unsafe"}]
Once binary output works, reintroduce JSON.
Phase G — visual ablations
Only after A-F pass:
full image
crop image
full + crop
full with drawn bbox
vision frozen
vision LoRA
language-only LoRA
Practical fixes depending on what fails
| Failed check | Likely fix |
|---|---|
labels != -100 includes prompt/user text |
Use prompt-completion dataset, assistant-only/completion-only loss, or custom VLM collator |
labels != -100 is empty or missing answer |
Increase max length, fix truncation, check chat template generation mask |
| Base and LoRA outputs identical | Verify adapter loading, checkpoint path, active adapter, merge/export |
| Training render and inference render differ | Use same processor/tokenizer/chat template/EOS; fix add_generation_prompt/prefill semantics |
| Generated text is valid-ish but parser fails | Make parser tolerant or use constrained decoding/prefill |
Parse failures counted as unsafe |
Add PARSE_FAIL class in evaluation |
safe/unsafe works but JSON fails |
Keep classification simple, then add JSON prefill or constrained decoding |
| Tiny overfit works, full train still poor | Then inspect class imbalance, ambiguous labels, bbox/crop, resolution, and dataset quality |
Things I would not change first
I would not start with:
- more epochs,
- larger LoRA rank,
- more data,
- more complex system prompt,
- vision layers everywhere,
- higher resolution,
- LR sweeps,
- bigger model,
until these four statements are true:
- The adapter is active during inference.
labels != -100decodes only to the assistant answer.- Training and inference render the same chat task.
- Evaluation decodes only generated tokens and does not default parse failures to
unsafe.
If any of those are false, hyperparameter tuning can make the logs look different without fixing the underlying contract.
My likely diagnosis
My strongest hypothesis is:
The training loss is low because the model is optimizing an easier token objective than the intended stepladder safety decision, or because inference is not using the same adapter/template/EOS contract as training.
The closest public issue is the TRL VLM full-sequence-loss discussion:
- huggingface/trl#3751
The most relevant official docs are:
- TRL SFTTrainer
- Transformers chat templates
- Unsloth Gemma 4 Fine-tuning Guide
The shortest reliable path is:
4-example nonce overfit
-> inspect labels != -100
-> compare rendered chat templates
-> generated-only decode
-> binary safe/unsafe target
-> JSON target
-> bbox/crop/vision ablations
If the model cannot pass the 4-example nonce overfit with correct assistant-only labels, I would not consider the original accuracy number meaningful yet.
Discussion in the ATmosphere