External Publication

Fine-tune our first 2B medical VLM on a single MacBook M4, beats Google's MedGemma 4B on MedXpertQA-MM eval dataset

Hugging Face Forums [Unofficial] May 16, 2026

Hey everyone,

I’ve been experimenting with fine-tuning smaller Vision-Language Models for specialized domains. I wanted to see how far I could push a 2B parameter model on medical reasoning tasks using targeted data.

Model: madrisight/MadriMed-VL-2B · Hugging Face

The benchmark performance vs. google/medgemma-4b-it

MedExpertQA-MM: 21.05 (vs. 18.8) — Outperforming a model twice its size on complex expertise tasks!

Slake: 65.7 (vs. 72.3)

VQA RAD: 43.09 (vs. 49.9)

Im also sharing the evaluation techniques: https://github.com/krrish-v/all_huggingface/blob/main/model_evaluation/MadriMed-VL-2B_evaluation.ipynb

Happy to discuss methodology, especially curious if anyone has thoughts on why the MedXpertQA generalization held up despite the SLAKE/VQA-RAD gap.

Discussion in the ATmosphere