External Publication

VLM Fine tuning: Near-Zero Training Loss but Poor Inference Accuracy on Train Set (Gemma 4 E2B It)

Hugging Face Forums [Unofficial] May 26, 2026

For now, there seem to be reports of similar cases:

I would debug this as a training/inference contract mismatch before treating it as a Gemma 4 E2B-it capability problem.

The combination of:

near-zero training loss,
poor inference even on training images,
strong prediction skew toward one class,
a very short target answer such as safe / unsafe or JSON,
VLM SFT through a high-level wrapper/UI,

is exactly the kind of pattern where the scalar loss may be telling you the model learned some tokens , but not necessarily the task decision you care about.

My current guess would be:

The effective training target, effective inference prompt, and evaluation parser are probably not the same task contract.

The most likely causes, in order:

Priority	Failure mode	Why it fits this symptom
1	Loss mask is wrong : loss is computed on the full rendered conversation, not only the assistant label/JSON	Long prompt/template tokens can drive loss near zero while the few `safe`/`unsafe` tokens remain poorly learned
2	Training and inference chat templates differ	VLM/chat models are sensitive to role markers, image placeholders, EOS, and assistant-start tokens
3	LoRA adapter/checkpoint/export is not actually used at inference	Training loss can be real, while inference accidentally uses base model behavior
4	Evaluation/parsing bug	Parse failures or prompt-echoes can be misread as `unsafe`, creating artificial class skew
5	Image/bbox/crop issue	Possible, but I would check this only after the tiny-overfit and masking tests pass
6	Gemma 4 / Unsloth / Transformers / TRL version issue	Possible, but less useful to assume before inspecting the actual batch labels and rendered prompt

Why I would not trust the near-zero loss yet

For this task, the assistant answer is tiny:

[{"id":"0","label":"unsafe"}]

But the rendered training sequence may contain:

system prompt,
user instruction,
image placeholder tokens,
bbox text,
formatting/control tokens,
assistant JSON answer.

If the trainer computes loss on the whole rendered sequence, the model can reduce loss mostly by learning deterministic prompt/template tokens. The actual classification decision may be only a few tokens out of the whole sequence.

This is a known issue class in TRL/VLM SFT:

TRL SFTTrainer docs
TRL issue #3751 — VLM SFT example computes loss for the entire sequence, including prompt/user content
TRL issue #5471 — assistant_only_loss=True requires {% generation %} / {% endgeneration %} markers
TRL issue #3781 — assistant_only_loss=True silently ignored with use_liger_kernel=True
HF Forum — SFTTrainer loss function and formatting_func
HF Forum — SFTTrainer works but without result

The key TRL doc detail is that completion_only_loss and assistant_only_loss are separate from ordinary full-sequence language-modeling loss. For prompt-completion datasets, completion-only loss can supervise only the completion. For conversational assistant-only training, the chat template must be able to return assistant/generation masks.

So the first question is not “why is loss low?” but:

Which tokens actually have labels other than -100?

Check 1: inspect the real supervised tokens

This is the single most important test.

batch = next(iter(trainer.get_train_dataloader()))

input_ids = batch["input_ids"][0]
labels = batch["labels"][0]

mask = labels != -100

print("input length:", input_ids.numel())
print("supervised token count:", mask.sum().item())
print(tokenizer.decode(input_ids[mask], skip_special_tokens=False))

Expected output should be close to only the assistant target:

[{"id":"0","label":"unsafe"}]

Bad output:

You are a safety vision model...
Inspect the stepladder...
Ladder bbox: ...
[{"id":"0","label":"unsafe"}]

Very bad output:

<image> <pad> <eos>

or almost no supervised tokens.

Interpretation:

What `labels != -100` decodes to	Interpretation
Only assistant JSON / only `safe` or `unsafe`	Loss target is probably OK
System/user prompt + assistant answer	Loss is probably diluted by prompt/template tokens
Image/pad/special tokens	Collator/token masking is likely wrong
Empty or almost empty	Truncation/template mask may be broken
Assistant answer but missing label token	Truncation or bad target formatting

If the supervised region is not the assistant answer only, I would not tune learning rate, epochs, rank, or vision layers yet. Fix the objective first.

Check 2: verify assistant-only / completion-only masking

If you can use prompt-completion form, prefer making the split explicit:

example = {
    "prompt": [
        {
            "role": "system",
            "content": "Classify stepladder use as safe or unsafe. Output JSON only."
        },
        {
            "role": "user",
            "content": [
                {"type": "image", "image": image},
                {"type": "text", "text": "Inspect the ladder. Ladder bbox: [x1,y1,x2,y2]."}
            ],
        },
    ],
    "completion": [
        {
            "role": "assistant",
            "content": '[{"id":"0","label":"unsafe"}]'
        }
    ],
}

The final label contract should be:

Token region	Label
system prompt	`-100`
user text	`-100`
image tokens	`-100`
pad tokens	`-100`
assistant JSON / class label	token IDs

For VLMs, you may need a custom collator rather than assuming the text-only assistant masking path works automatically. This is especially important because VLM processors and chat templates may go through a different path from ordinary text tokenizers.

Related resources:

TRL SFTTrainer docs — VLM support
TRL VLM full-sequence loss issue
HF Forum — assistant_only_loss=True and VLM/processor path confusion

Check 3: compare training vs inference chat rendering

A second high-probability failure mode is that training and inference do not render the same chat contract.

Useful references:

Transformers docs — Chat templates
Transformers docs — add_generation_prompt vs continue_final_message
Unsloth Gemma 4 Fine-tuning Guide
Unsloth chat templates docs
Google Gemma vision fine-tuning with Hugging Face

Print the exact rendered training string and inference string:

train_text = processor.apply_chat_template(
    train_messages,
    tokenize=False,
    add_generation_prompt=False,
)

infer_text = processor.apply_chat_template(
    infer_messages_without_assistant,
    tokenize=False,
    add_generation_prompt=True,
)

print("===== TRAIN RENDERED =====")
print(train_text)
print("===== INFER RENDERED =====")
print(infer_text)

Check:

same system prompt;
same user instruction;
same role markers;
image placeholder appears in the same position;
multimodal content order is consistent, usually image before text for Gemma-style multimodal prompts;
no duplicated BOS/EOS;
inference contains the correct assistant-start marker;
training does not accidentally include a generation prompt before the gold answer;
exported runtime uses the same chat template and EOS token.

This matters because chat models do not directly consume abstract Python dictionaries like:

{"role": "user", "content": "..."}

They consume a rendered token sequence. If the rendered sequence differs, the model may be seeing a different task.

Check 4: do a nonce overfit to verify adapter/checkpoint/export

If this is LoRA/QLoRA, the fine-tuned behavior lives in the adapter unless it is correctly merged/exported.

Do a tiny debug run:

create 4 examples;
add one impossible target label;
train briefly;
run inference on the exact same example.

Example target:

[{"id":"0","label":"DEBUG_TOKEN_7F3A"}]

Interpretation:

Result	Meaning
Model emits `DEBUG_TOKEN_7F3A` in the same training environment	Adapter and training path probably work
Model cannot emit the nonce even on the training sample	Adapter, labels, template, or training loop is suspect
Studio/in-training inference emits nonce, exported model does not	Export/runtime/template/EOS issue
Base and LoRA outputs are almost identical	Adapter may not be loaded or active
Merged model differs from base+adapter	Merge/export path may be wrong

Useful references:

TRL SFTTrainer docs — PEFT integration
Google Gemma QLoRA guide
Unsloth Gemma 4 Fine-tuning Guide
Medium — Fine-tuning Gemma 4 E2B step-by-step with Unsloth

Check 5: decode generated tokens only

For evaluation, do not decode prompt + generation together.

Use generated-only decoding:

outputs = model.generate(
    **inputs,
    max_new_tokens=32,
    do_sample=False,
)

prompt_len = inputs["input_ids"].shape[1]
generated_ids = outputs[:, prompt_len:]

text = processor.batch_decode(
    generated_ids,
    skip_special_tokens=True,
)[0]

print(text)

Then separate:

JSON parse success,
field/key extraction success,
label extraction success,
class distribution,
final accuracy.

Do not default parse failures to unsafe.

Bad:

if parse_failed:
    pred = "unsafe"

Better:

if parse_failed:
    pred = "PARSE_FAIL"
elif label not in {"safe", "unsafe"}:
    pred = "INVALID_LABEL"
else:
    pred = label

Suggested report:

strict_json_parse_rate
label_extraction_rate
parse_fail_count
invalid_label_count
safe_count
unsafe_count
accuracy_on_parseable_outputs
overall_accuracy

A strong unsafe skew can be caused by model bias, but it can also be caused by parse-failure fallback.

Related resources:

HF Forum — VLM structured JSON/domain fine-tuning discussion
AWS — Fine-tune VLMs for multipage document-to-JSON
AWS sample repo — multimodal document-to-JSON with SageMaker AI

Check 6: temporarily remove JSON

Before debugging visual reasoning and JSON formatting at the same time, simplify the target:

unsafe

or:

safe

Tiny-overfit test:

Test	Meaning
4 examples, target only `safe`/`unsafe`, train-set accuracy near 100%	Basic adapter + visual/task path works
`safe`/`unsafe` works, JSON fails	JSON formatting/parser/decode contract is the issue
`safe`/`unsafe` also fails on 4 examples	Objective, adapter, template, or image input is still broken
JSON parse fails but label appears in raw text	Parser/evaluator is too strict
Label is never generated	Training target or inference prompt likely wrong

Once this passes, reintroduce JSON:

[{"id":"0","label":"unsafe"}]

If JSON must be stable, consider prefill:

[{"id":"0","label":"

Then generate only the label continuation. In Transformers terminology, this is closer to continuing the final assistant message than starting a new assistant message, so be careful with add_generation_prompt vs continue_final_message.

Reference:

Transformers chat templates — generation prompts and continuing final messages

Check 7: use constrained or low-entropy decoding for classification

For debugging, use deterministic decoding:

outputs = model.generate(
    **inputs,
    max_new_tokens=16,
    do_sample=False,
)

For a binary task, you can also compare label token scores instead of free-form generation:

# Conceptual sketch:
# Prompt ends with: [{"id":"0","label":"
# Compare next-token / next-string probability for "safe" vs "unsafe"

This removes:

sampling noise,
malformed JSON,
explanation text,
markdown fences,
run-on generation.

If logit comparison works but full JSON generation fails, the classification signal may be present but the output contract is unstable.

Check 8: only then investigate image/bbox design

Once tiny overfit, adapter loading, label masking, template rendering, and evaluation are proven correct, then test the visual side.

Compare:

full image only;
ladder crop only;
full image + ladder crop;
full image with bbox drawn;
different resolutions / visual token budgets;
frozen vision layers vs vision LoRA;
language-only LoRA vs vision+language LoRA.

For a safety/bbox task, raw coordinate text may be less effective than giving the model either a crop or a visible marked region.

Useful references:

Unsloth vision fine-tuning docs
Unsloth Gemma 4 Fine-tuning Guide
Google Gemma image understanding docs
Google Gemma vision fine-tuning with Hugging Face

Related examples where the general method works

This does not prove the exact stepladder dataset should work immediately, but it shows that the overall approach is valid when the data contract and evaluation contract are correct.

Structured image-to-JSON VLM fine-tuning

AWS has a document-to-JSON VLM fine-tuning example and sample repo. Their repo reports that smaller models such as Qwen2.5-VL 3B can achieve high exact extraction accuracy on a document-to-JSON task after fine-tuning.

AWS blog — Fine-tune VLMs for multipage document-to-JSON with SageMaker AI and SWIFT
AWS sample repo — sample-for-multi-modal-document-to-json-with-sagemaker-ai

This is conceptually close to:

image -> structured JSON

Your task is:

worksite image -> structured JSON label

So I would not conclude that “VLMs cannot do this”. I would first suspect the pipeline.

VLM SFT with TRL

There are multiple public VLM SFT recipes using TRL:

Hugging Face Cookbook — Fine-Tuning a Vision Language Model Qwen2-VL-7B with TRL
Hugging Face Cookbook — Fine-tuning SmolVLM with TRL
Phil Schmid — Fine-tune multimodal LLMs / VLMs with TRL
Daniel van Strien — Fine-tuning VLMs for Art History with TRL and HF Jobs
AMD ROCm tutorial — Fine-tuning Qwen2-VL-7B on ChartQA with LoRA

These examples are useful because they establish a baseline: VLM SFT itself is a normal workflow. If a model cannot overfit even 4 training examples, that is usually a contract/debug issue, not a reason to start with large hyperparameter sweeps.

Minimal debug sequence I would run

Phase A — freeze evidence

Record versions and runtime:

import torch, transformers, trl, peft

print("torch:", torch.__version__)
print("transformers:", transformers.__version__)
print("trl:", trl.__version__)
print("peft:", peft.__version__)

try:
    import unsloth
    print("unsloth:", getattr(unsloth, "__version__", "unknown"))
except Exception as e:
    print("unsloth import error:", repr(e))

Also record:

base model revision
adapter checkpoint path
export format
processor/tokenizer path
chat template
EOS token
PAD token
image processor settings
max_seq_length
max_new_tokens
do_sample

Phase B — 4-example overfit

Train on 4 examples.

Use one target like:

[{"id":"0","label":"DEBUG_TOKEN_7F3A"}]

Expected: exact training examples should be reproduced.

If this fails, stop and inspect adapter/template/labels.

Phase C — inspect batch labels

batch = next(iter(trainer.get_train_dataloader()))

input_ids = batch["input_ids"][0]
labels = batch["labels"][0]
mask = labels != -100

print(tokenizer.decode(input_ids[mask], skip_special_tokens=False))

Expected: only assistant answer.

If not, fix collator/objective.

Phase D — compare rendered templates

train_text = processor.apply_chat_template(
    train_messages,
    tokenize=False,
    add_generation_prompt=False,
)

infer_text = processor.apply_chat_template(
    infer_messages_without_assistant,
    tokenize=False,
    add_generation_prompt=True,
)

print(train_text)
print(infer_text)

Expected: same task prefix, correct assistant generation start.

Phase E — generated-only evaluation

outputs = model.generate(
    **inputs,
    max_new_tokens=32,
    do_sample=False,
)

prompt_len = inputs["input_ids"].shape[1]
generated_ids = outputs[:, prompt_len:]
text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]

Then report parse metrics separately from classification metrics.

Phase F — simplify output

First train:

unsafe

not:

[{"id":"0","label":"unsafe"}]

Once binary output works, reintroduce JSON.

Phase G — visual ablations

Only after A-F pass:

full image
crop image
full + crop
full with drawn bbox
vision frozen
vision LoRA
language-only LoRA

Practical fixes depending on what fails

Failed check	Likely fix
`labels != -100` includes prompt/user text	Use prompt-completion dataset, assistant-only/completion-only loss, or custom VLM collator
`labels != -100` is empty or missing answer	Increase max length, fix truncation, check chat template generation mask
Base and LoRA outputs identical	Verify adapter loading, checkpoint path, active adapter, merge/export
Training render and inference render differ	Use same processor/tokenizer/chat template/EOS; fix `add_generation_prompt`/prefill semantics
Generated text is valid-ish but parser fails	Make parser tolerant or use constrained decoding/prefill
Parse failures counted as `unsafe`	Add `PARSE_FAIL` class in evaluation
`safe`/`unsafe` works but JSON fails	Keep classification simple, then add JSON prefill or constrained decoding
Tiny overfit works, full train still poor	Then inspect class imbalance, ambiguous labels, bbox/crop, resolution, and dataset quality

Things I would not change first

I would not start with:

more epochs,
larger LoRA rank,
more data,
more complex system prompt,
vision layers everywhere,
higher resolution,
LR sweeps,
bigger model,

until these four statements are true:

The adapter is active during inference.
labels != -100 decodes only to the assistant answer.
Training and inference render the same chat task.
Evaluation decodes only generated tokens and does not default parse failures to unsafe.

If any of those are false, hyperparameter tuning can make the logs look different without fixing the underlying contract.

My likely diagnosis

My strongest hypothesis is:

The training loss is low because the model is optimizing an easier token objective than the intended stepladder safety decision, or because inference is not using the same adapter/template/EOS contract as training.

The closest public issue is the TRL VLM full-sequence-loss discussion:

huggingface/trl#3751

The most relevant official docs are:

TRL SFTTrainer
Transformers chat templates
Unsloth Gemma 4 Fine-tuning Guide

The shortest reliable path is:

4-example nonce overfit
-> inspect labels != -100
-> compare rendered chat templates
-> generated-only decode
-> binary safe/unsafe target
-> JSON target
-> bbox/crop/vision ablations

If the model cannot pass the 4-example nonce overfit with correct assistant-only labels, I would not consider the original accuracy number meaningful yet.

Why I would not trust the near-zero loss yet

Check 1: inspect the real supervised tokens

Check 2: verify assistant-only / completion-only masking

Check 3: compare training vs inference chat rendering

Check 4: do a nonce overfit to verify adapter/checkpoint/export

Check 5: decode generated tokens only

Check 6: temporarily remove JSON

Check 7: use constrained or low-entropy decoding for classification

Check 8: only then investigate image/bbox design

Related examples where the general method works

Structured image-to-JSON VLM fine-tuning

VLM SFT with TRL

Minimal debug sequence I would run

Phase A — freeze evidence

Phase B — 4-example overfit

Phase C — inspect batch labels

Phase D — compare rendered templates

Phase E — generated-only evaluation

Phase F — simplify output

Phase G — visual ablations

Practical fixes depending on what fails

Things I would not change first

My likely diagnosis

Discussion in the ATmosphere