{
"$type": "site.standard.document",
"bskyPostRef": {
"cid": "bafyreifwnbvdg5zuqwvdqxdwq3zd43srg6ynfxiosto7kyrsp75hi2fvn4",
"uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mlyhjaw4wz32"
},
"path": "/t/fine-tune-our-first-2b-medical-vlm-on-a-single-macbook-m4-beats-googles-medgemma-4b-on-medxpertqa-mm-eval-dataset/176061#post_1",
"publishedAt": "2026-05-16T16:21:46.000Z",
"site": "https://discuss.huggingface.co",
"tags": [
"madrisight/MadriMed-VL-2B · Hugging Face",
"https://github.com/krrish-v/all_huggingface/blob/main/model_evaluation/MadriMed-VL-2B_evaluation.ipynb"
],
"textContent": "Hey everyone,\n\nI’ve been experimenting with fine-tuning smaller Vision-Language Models for specialized domains. I wanted to see how far I could push a 2B parameter model on medical reasoning tasks using targeted data.\n\nModel: madrisight/MadriMed-VL-2B · Hugging Face\n\nThe benchmark performance vs. google/medgemma-4b-it\n\nMedExpertQA-MM: 21.05 (vs. 18.8) — Outperforming a model twice its size on complex expertise tasks!\n\nSlake: 65.7 (vs. 72.3)\n\nVQA RAD: 43.09 (vs. 49.9)\n\nIm also sharing the evaluation techniques: https://github.com/krrish-v/all_huggingface/blob/main/model_evaluation/MadriMed-VL-2B_evaluation.ipynb\n\nHappy to discuss methodology, especially curious if anyone has thoughts on why the MedXpertQA generalization held up despite the SLAKE/VQA-RAD gap.",
"title": "Fine-tune our first 2B medical VLM on a single MacBook M4, beats Google's MedGemma 4B on MedXpertQA-MM eval dataset"
}