{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreifwnbvdg5zuqwvdqxdwq3zd43srg6ynfxiosto7kyrsp75hi2fvn4",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mlyhjaw4wz32"
  },
  "path": "/t/fine-tune-our-first-2b-medical-vlm-on-a-single-macbook-m4-beats-googles-medgemma-4b-on-medxpertqa-mm-eval-dataset/176061#post_1",
  "publishedAt": "2026-05-16T16:21:46.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "madrisight/MadriMed-VL-2B · Hugging Face",
    "https://github.com/krrish-v/all_huggingface/blob/main/model_evaluation/MadriMed-VL-2B_evaluation.ipynb"
  ],
  "textContent": "Hey everyone,\n\nI’ve been experimenting with fine-tuning smaller Vision-Language Models for specialized domains. I wanted to see how far I could push a 2B parameter model on medical reasoning tasks using targeted data.\n\nModel: madrisight/MadriMed-VL-2B · Hugging Face\n\nThe benchmark performance vs. google/medgemma-4b-it\n\nMedExpertQA-MM: 21.05 (vs. 18.8) — Outperforming a model twice its size on complex expertise tasks!\n\nSlake: 65.7 (vs. 72.3)\n\nVQA RAD: 43.09 (vs. 49.9)\n\nIm also sharing the evaluation techniques: https://github.com/krrish-v/all_huggingface/blob/main/model_evaluation/MadriMed-VL-2B_evaluation.ipynb\n\nHappy to discuss methodology, especially curious if anyone has thoughts on why the MedXpertQA generalization held up despite the SLAKE/VQA-RAD gap.",
  "title": "Fine-tune our first 2B medical VLM on a single MacBook M4, beats Google's MedGemma 4B on MedXpertQA-MM eval dataset"
}