Fine-tune our first 2B medical VLM on a single MacBook M4, beats Google's MedGemma 4B on MedXpertQA-MM eval dataset
Hey everyone,
I’ve been experimenting with fine-tuning smaller Vision-Language Models for specialized domains. I wanted to see how far I could push a 2B parameter model on medical reasoning tasks using targeted data.
Model: madrisight/MadriMed-VL-2B · Hugging Face
The benchmark performance vs. google/medgemma-4b-it
MedExpertQA-MM: 21.05 (vs. 18.8) — Outperforming a model twice its size on complex expertise tasks!
Slake: 65.7 (vs. 72.3)
VQA RAD: 43.09 (vs. 49.9)
Im also sharing the evaluation techniques: https://github.com/krrish-v/all_huggingface/blob/main/model_evaluation/MadriMed-VL-2B_evaluation.ipynb
Happy to discuss methodology, especially curious if anyone has thoughts on why the MedXpertQA generalization held up despite the SLAKE/VQA-RAD gap.
Discussion in the ATmosphere