External Publication
Visit Post

VLM Fine tuning: Near-Zero Training Loss but Poor Inference Accuracy on Train Set (Gemma 4 E2B It)

Hugging Face Forums [Unofficial] May 26, 2026
Source

For now, there seem to be reports of similar cases:


I would debug this as a training/inference contract mismatch before treating it as a Gemma 4 E2B-it capability problem.

The combination of:

  • near-zero training loss,
  • poor inference even on training images,
  • strong prediction skew toward one class,
  • a very short target answer such as safe / unsafe or JSON,
  • VLM SFT through a high-level wrapper/UI,

is exactly the kind of pattern where the scalar loss may be telling you the model learned some tokens , but not necessarily the task decision you care about.

My current guess would be:

The effective training target, effective inference prompt, and evaluation parser are probably not the same task contract.

The most likely causes, in order:

Priority Failure mode Why it fits this symptom
1 Loss mask is wrong : loss is computed on the full rendered conversation, not only the assistant label/JSON Long prompt/template tokens can drive loss near zero while the few safe/unsafe tokens remain poorly learned
2 Training and inference chat templates differ VLM/chat models are sensitive to role markers, image placeholders, EOS, and assistant-start tokens
3 LoRA adapter/checkpoint/export is not actually used at inference Training loss can be real, while inference accidentally uses base model behavior
4 Evaluation/parsing bug Parse failures or prompt-echoes can be misread as unsafe, creating artificial class skew
5 Image/bbox/crop issue Possible, but I would check this only after the tiny-overfit and masking tests pass
6 Gemma 4 / Unsloth / Transformers / TRL version issue Possible, but less useful to assume before inspecting the actual batch labels and rendered prompt

Why I would not trust the near-zero loss yet

For this task, the assistant answer is tiny:

[{"id":"0","label":"unsafe"}]

But the rendered training sequence may contain:

  • system prompt,
  • user instruction,
  • image placeholder tokens,
  • bbox text,
  • formatting/control tokens,
  • assistant JSON answer.

If the trainer computes loss on the whole rendered sequence, the model can reduce loss mostly by learning deterministic prompt/template tokens. The actual classification decision may be only a few tokens out of the whole sequence.

This is a known issue class in TRL/VLM SFT:

  • TRL SFTTrainer docs
  • TRL issue #3751 — VLM SFT example computes loss for the entire sequence, including prompt/user content
  • TRL issue #5471 — assistant_only_loss=True requires {% generation %} / {% endgeneration %} markers
  • TRL issue #3781 — assistant_only_loss=True silently ignored with use_liger_kernel=True
  • HF Forum — SFTTrainer loss function and formatting_func
  • HF Forum — SFTTrainer works but without result

The key TRL doc detail is that completion_only_loss and assistant_only_loss are separate from ordinary full-sequence language-modeling loss. For prompt-completion datasets, completion-only loss can supervise only the completion. For conversational assistant-only training, the chat template must be able to return assistant/generation masks.

So the first question is not “why is loss low?” but:

Which tokens actually have labels other than -100?

Check 1: inspect the real supervised tokens

This is the single most important test.

batch = next(iter(trainer.get_train_dataloader()))

input_ids = batch["input_ids"][0]
labels = batch["labels"][0]

mask = labels != -100

print("input length:", input_ids.numel())
print("supervised token count:", mask.sum().item())
print(tokenizer.decode(input_ids[mask], skip_special_tokens=False))

Expected output should be close to only the assistant target:

[{"id":"0","label":"unsafe"}]

Bad output:

You are a safety vision model...
Inspect the stepladder...
Ladder bbox: ...
[{"id":"0","label":"unsafe"}]

Very bad output:

<image> <pad> <eos>

or almost no supervised tokens.

Interpretation:

What labels != -100 decodes to Interpretation
Only assistant JSON / only safe or unsafe Loss target is probably OK
System/user prompt + assistant answer Loss is probably diluted by prompt/template tokens
Image/pad/special tokens Collator/token masking is likely wrong
Empty or almost empty Truncation/template mask may be broken
Assistant answer but missing label token Truncation or bad target formatting

If the supervised region is not the assistant answer only, I would not tune learning rate, epochs, rank, or vision layers yet. Fix the objective first.

Check 2: verify assistant-only / completion-only masking

If you can use prompt-completion form, prefer making the split explicit:

example = {
    "prompt": [
        {
            "role": "system",
            "content": "Classify stepladder use as safe or unsafe. Output JSON only."
        },
        {
            "role": "user",
            "content": [
                {"type": "image", "image": image},
                {"type": "text", "text": "Inspect the ladder. Ladder bbox: [x1,y1,x2,y2]."}
            ],
        },
    ],
    "completion": [
        {
            "role": "assistant",
            "content": '[{"id":"0","label":"unsafe"}]'
        }
    ],
}

The final label contract should be:

Token region Label
system prompt -100
user text -100
image tokens -100
pad tokens -100
assistant JSON / class label token IDs

For VLMs, you may need a custom collator rather than assuming the text-only assistant masking path works automatically. This is especially important because VLM processors and chat templates may go through a different path from ordinary text tokenizers.

Related resources:

  • TRL SFTTrainer docs — VLM support
  • TRL VLM full-sequence loss issue
  • HF Forum — assistant_only_loss=True and VLM/processor path confusion

Check 3: compare training vs inference chat rendering

A second high-probability failure mode is that training and inference do not render the same chat contract.

Useful references:

  • Transformers docs — Chat templates
  • Transformers docs — add_generation_prompt vs continue_final_message
  • Unsloth Gemma 4 Fine-tuning Guide
  • Unsloth chat templates docs
  • Google Gemma vision fine-tuning with Hugging Face

Print the exact rendered training string and inference string:

train_text = processor.apply_chat_template(
    train_messages,
    tokenize=False,
    add_generation_prompt=False,
)

infer_text = processor.apply_chat_template(
    infer_messages_without_assistant,
    tokenize=False,
    add_generation_prompt=True,
)

print("===== TRAIN RENDERED =====")
print(train_text)
print("===== INFER RENDERED =====")
print(infer_text)

Check:

  • same system prompt;
  • same user instruction;
  • same role markers;
  • image placeholder appears in the same position;
  • multimodal content order is consistent, usually image before text for Gemma-style multimodal prompts;
  • no duplicated BOS/EOS;
  • inference contains the correct assistant-start marker;
  • training does not accidentally include a generation prompt before the gold answer;
  • exported runtime uses the same chat template and EOS token.

This matters because chat models do not directly consume abstract Python dictionaries like:

{"role": "user", "content": "..."}

They consume a rendered token sequence. If the rendered sequence differs, the model may be seeing a different task.

Check 4: do a nonce overfit to verify adapter/checkpoint/export

If this is LoRA/QLoRA, the fine-tuned behavior lives in the adapter unless it is correctly merged/exported.

Do a tiny debug run:

  1. create 4 examples;
  2. add one impossible target label;
  3. train briefly;
  4. run inference on the exact same example.

Example target:

[{"id":"0","label":"DEBUG_TOKEN_7F3A"}]

Interpretation:

Result Meaning
Model emits DEBUG_TOKEN_7F3A in the same training environment Adapter and training path probably work
Model cannot emit the nonce even on the training sample Adapter, labels, template, or training loop is suspect
Studio/in-training inference emits nonce, exported model does not Export/runtime/template/EOS issue
Base and LoRA outputs are almost identical Adapter may not be loaded or active
Merged model differs from base+adapter Merge/export path may be wrong

Useful references:

  • TRL SFTTrainer docs — PEFT integration
  • Google Gemma QLoRA guide
  • Unsloth Gemma 4 Fine-tuning Guide
  • Medium — Fine-tuning Gemma 4 E2B step-by-step with Unsloth

Check 5: decode generated tokens only

For evaluation, do not decode prompt + generation together.

Use generated-only decoding:

outputs = model.generate(
    **inputs,
    max_new_tokens=32,
    do_sample=False,
)

prompt_len = inputs["input_ids"].shape[1]
generated_ids = outputs[:, prompt_len:]

text = processor.batch_decode(
    generated_ids,
    skip_special_tokens=True,
)[0]

print(text)

Then separate:

  • JSON parse success,
  • field/key extraction success,
  • label extraction success,
  • class distribution,
  • final accuracy.

Do not default parse failures to unsafe.

Bad:

if parse_failed:
    pred = "unsafe"

Better:

if parse_failed:
    pred = "PARSE_FAIL"
elif label not in {"safe", "unsafe"}:
    pred = "INVALID_LABEL"
else:
    pred = label

Suggested report:

strict_json_parse_rate
label_extraction_rate
parse_fail_count
invalid_label_count
safe_count
unsafe_count
accuracy_on_parseable_outputs
overall_accuracy

A strong unsafe skew can be caused by model bias, but it can also be caused by parse-failure fallback.

Related resources:

  • HF Forum — VLM structured JSON/domain fine-tuning discussion
  • AWS — Fine-tune VLMs for multipage document-to-JSON
  • AWS sample repo — multimodal document-to-JSON with SageMaker AI

Check 6: temporarily remove JSON

Before debugging visual reasoning and JSON formatting at the same time, simplify the target:

unsafe

or:

safe

Tiny-overfit test:

Test Meaning
4 examples, target only safe/unsafe, train-set accuracy near 100% Basic adapter + visual/task path works
safe/unsafe works, JSON fails JSON formatting/parser/decode contract is the issue
safe/unsafe also fails on 4 examples Objective, adapter, template, or image input is still broken
JSON parse fails but label appears in raw text Parser/evaluator is too strict
Label is never generated Training target or inference prompt likely wrong

Once this passes, reintroduce JSON:

[{"id":"0","label":"unsafe"}]

If JSON must be stable, consider prefill:

[{"id":"0","label":"

Then generate only the label continuation. In Transformers terminology, this is closer to continuing the final assistant message than starting a new assistant message, so be careful with add_generation_prompt vs continue_final_message.

Reference:

  • Transformers chat templates — generation prompts and continuing final messages

Check 7: use constrained or low-entropy decoding for classification

For debugging, use deterministic decoding:

outputs = model.generate(
    **inputs,
    max_new_tokens=16,
    do_sample=False,
)

For a binary task, you can also compare label token scores instead of free-form generation:

# Conceptual sketch:
# Prompt ends with: [{"id":"0","label":"
# Compare next-token / next-string probability for "safe" vs "unsafe"

This removes:

  • sampling noise,
  • malformed JSON,
  • explanation text,
  • markdown fences,
  • run-on generation.

If logit comparison works but full JSON generation fails, the classification signal may be present but the output contract is unstable.

Check 8: only then investigate image/bbox design

Once tiny overfit, adapter loading, label masking, template rendering, and evaluation are proven correct, then test the visual side.

Compare:

  1. full image only;
  2. ladder crop only;
  3. full image + ladder crop;
  4. full image with bbox drawn;
  5. different resolutions / visual token budgets;
  6. frozen vision layers vs vision LoRA;
  7. language-only LoRA vs vision+language LoRA.

For a safety/bbox task, raw coordinate text may be less effective than giving the model either a crop or a visible marked region.

Useful references:

  • Unsloth vision fine-tuning docs
  • Unsloth Gemma 4 Fine-tuning Guide
  • Google Gemma image understanding docs
  • Google Gemma vision fine-tuning with Hugging Face

Related examples where the general method works

This does not prove the exact stepladder dataset should work immediately, but it shows that the overall approach is valid when the data contract and evaluation contract are correct.

Structured image-to-JSON VLM fine-tuning

AWS has a document-to-JSON VLM fine-tuning example and sample repo. Their repo reports that smaller models such as Qwen2.5-VL 3B can achieve high exact extraction accuracy on a document-to-JSON task after fine-tuning.

  • AWS blog — Fine-tune VLMs for multipage document-to-JSON with SageMaker AI and SWIFT
  • AWS sample repo — sample-for-multi-modal-document-to-json-with-sagemaker-ai

This is conceptually close to:

image -> structured JSON

Your task is:

worksite image -> structured JSON label

So I would not conclude that “VLMs cannot do this”. I would first suspect the pipeline.

VLM SFT with TRL

There are multiple public VLM SFT recipes using TRL:

  • Hugging Face Cookbook — Fine-Tuning a Vision Language Model Qwen2-VL-7B with TRL
  • Hugging Face Cookbook — Fine-tuning SmolVLM with TRL
  • Phil Schmid — Fine-tune multimodal LLMs / VLMs with TRL
  • Daniel van Strien — Fine-tuning VLMs for Art History with TRL and HF Jobs
  • AMD ROCm tutorial — Fine-tuning Qwen2-VL-7B on ChartQA with LoRA

These examples are useful because they establish a baseline: VLM SFT itself is a normal workflow. If a model cannot overfit even 4 training examples, that is usually a contract/debug issue, not a reason to start with large hyperparameter sweeps.

Minimal debug sequence I would run

Phase A — freeze evidence

Record versions and runtime:

import torch, transformers, trl, peft

print("torch:", torch.__version__)
print("transformers:", transformers.__version__)
print("trl:", trl.__version__)
print("peft:", peft.__version__)

try:
    import unsloth
    print("unsloth:", getattr(unsloth, "__version__", "unknown"))
except Exception as e:
    print("unsloth import error:", repr(e))

Also record:

base model revision
adapter checkpoint path
export format
processor/tokenizer path
chat template
EOS token
PAD token
image processor settings
max_seq_length
max_new_tokens
do_sample

Phase B — 4-example overfit

Train on 4 examples.

Use one target like:

[{"id":"0","label":"DEBUG_TOKEN_7F3A"}]

Expected: exact training examples should be reproduced.

If this fails, stop and inspect adapter/template/labels.

Phase C — inspect batch labels

batch = next(iter(trainer.get_train_dataloader()))

input_ids = batch["input_ids"][0]
labels = batch["labels"][0]
mask = labels != -100

print(tokenizer.decode(input_ids[mask], skip_special_tokens=False))

Expected: only assistant answer.

If not, fix collator/objective.

Phase D — compare rendered templates

train_text = processor.apply_chat_template(
    train_messages,
    tokenize=False,
    add_generation_prompt=False,
)

infer_text = processor.apply_chat_template(
    infer_messages_without_assistant,
    tokenize=False,
    add_generation_prompt=True,
)

print(train_text)
print(infer_text)

Expected: same task prefix, correct assistant generation start.

Phase E — generated-only evaluation

outputs = model.generate(
    **inputs,
    max_new_tokens=32,
    do_sample=False,
)

prompt_len = inputs["input_ids"].shape[1]
generated_ids = outputs[:, prompt_len:]
text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]

Then report parse metrics separately from classification metrics.

Phase F — simplify output

First train:

unsafe

not:

[{"id":"0","label":"unsafe"}]

Once binary output works, reintroduce JSON.

Phase G — visual ablations

Only after A-F pass:

full image
crop image
full + crop
full with drawn bbox
vision frozen
vision LoRA
language-only LoRA

Practical fixes depending on what fails

Failed check Likely fix
labels != -100 includes prompt/user text Use prompt-completion dataset, assistant-only/completion-only loss, or custom VLM collator
labels != -100 is empty or missing answer Increase max length, fix truncation, check chat template generation mask
Base and LoRA outputs identical Verify adapter loading, checkpoint path, active adapter, merge/export
Training render and inference render differ Use same processor/tokenizer/chat template/EOS; fix add_generation_prompt/prefill semantics
Generated text is valid-ish but parser fails Make parser tolerant or use constrained decoding/prefill
Parse failures counted as unsafe Add PARSE_FAIL class in evaluation
safe/unsafe works but JSON fails Keep classification simple, then add JSON prefill or constrained decoding
Tiny overfit works, full train still poor Then inspect class imbalance, ambiguous labels, bbox/crop, resolution, and dataset quality

Things I would not change first

I would not start with:

  • more epochs,
  • larger LoRA rank,
  • more data,
  • more complex system prompt,
  • vision layers everywhere,
  • higher resolution,
  • LR sweeps,
  • bigger model,

until these four statements are true:

  1. The adapter is active during inference.
  2. labels != -100 decodes only to the assistant answer.
  3. Training and inference render the same chat task.
  4. Evaluation decodes only generated tokens and does not default parse failures to unsafe.

If any of those are false, hyperparameter tuning can make the logs look different without fixing the underlying contract.

My likely diagnosis

My strongest hypothesis is:

The training loss is low because the model is optimizing an easier token objective than the intended stepladder safety decision, or because inference is not using the same adapter/template/EOS contract as training.

The closest public issue is the TRL VLM full-sequence-loss discussion:

  • huggingface/trl#3751

The most relevant official docs are:

  • TRL SFTTrainer
  • Transformers chat templates
  • Unsloth Gemma 4 Fine-tuning Guide

The shortest reliable path is:

4-example nonce overfit
-> inspect labels != -100
-> compare rendered chat templates
-> generated-only decode
-> binary safe/unsafe target
-> JSON target
-> bbox/crop/vision ablations

If the model cannot pass the 4-example nonce overfit with correct assistant-only labels, I would not consider the original accuracy number meaningful yet.

Discussion in the ATmosphere

Loading comments...