VLM Fine tuning: Near-Zero Training Loss but Poor Inference Accuracy on Train Set (Gemma 4 E2B It)
Hi everyone
I am currently fine-tuning the Gemma 4 E2B model for a worker safety project. My goal is to classify whether a worker is using a stepladder safely based on specific safety guidelines (e.g., step position, orientation, and ladder stability).
The Problem: I am facing a strange behavior: My Training Loss converges to near zero (~0.001) very quickly. However, when I run inference on the exact same training images to calculate metrics, the performance is extremely poor (Accuracy ~50%, with a heavy bias towards the “unsafe” class).
Dataset Format: I reformatted my dataset so the Assistant outputs a single JSON string. I also provide the bounding box of the ladder in the User prompt to focus the model’s attention.
{ “messages”: [ { “role”: “system”, “content”: “You are a safety vision model… [Detailed Safety Rules]… Output JSON only.” }, { “role”: “user”, “content”: [ {“type”: “image”, “image”: “<PIL.Image>”}, {“type”: “text”, “text”: “Inspect the stepladder…”} ] }, { “role”: “assistant”, “content”: [{“type”: “text”, “text”: “[{“id”: “0”, “label”: “unsafe”}]”}] } ] }
Framework & Environment:
Training Tool: Unsloth Studio (Web UI)
Base Model**:** Gemma-4 E2B it
PEFT Method**:** LoRA (Fine-tuning both Vision and Language adapters)
Has anyone encountered this “Zero Loss but Zero Performance” issue with Gemma VLM or similar models? Please help me now i am so stuck
Discussion in the ATmosphere