Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreiehjiggvrlh2jyscuiqrn6eiidlkczn7isunjtjdbkjq7ozc6qp4q",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mk2s2653zy72"
  },
  "path": "/t/medgemma-1-5-4b-useful/175445#post_2",
  "publishedAt": "2026-04-22T05:22:16.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "the Hugging Science Discord",
    "Google for Developers",
    "LM Studio",
    "Hugging Face",
    "OpenReview",
    "arXiv"
  ],
  "textContent": "Since this is a medical question, you might get a good answer if you ask on the Hugging Science Discord. Generally speaking:\n\n* * *\n\nHere is the plainest, most honest answer.\n\n## Bottom line\n\nFor the way you tested it, **MedGemma 1.5 4B is probably not useful enough**.\n\n  * **Skin mole from a casual image:** mostly **no**.\n  * **Blood test report screenshot:** also mostly **no**.\n  * **Chest X-rays:** **possibly yes** , but only if you use it in a much narrower, more structured way. (Google for Developers)\n\n\n\nThat does **not** mean the model is bad. It means your current workflow is a bad match for what the model is actually designed and evaluated for. Google describes MedGemma as a **developer foundation model** for healthcare applications, not a finished clinical assistant, and says it should be **validated, adapted, grounded, orchestrated, or fine-tuned** for the target use case. The model card also says its outputs are **not intended to directly inform diagnosis or treatment** , that it has been evaluated mainly on **single-image tasks** , that it is **not optimized for multi-turn use** , and that it is **more prompt-sensitive** than Gemma 3. (Google for Developers)\n\n## Why your tests went badly\n\n### 1. You used it like a finished product, but it is really a base model\n\nA good way to think about MedGemma is: it is closer to an **engine** than a **complete car**. It can be strong inside a workflow, but by itself it is not necessarily a reliable end-user medical app. Google explicitly says adaptation may involve **prompt engineering, grounding, agentic orchestration, or fine-tuning** , depending on the use case and required validation. (Google for Developers)\n\nThat matters because your tests asked it to do many things at once:\n\n  * understand an image\n  * read tiny text\n  * preserve layout\n  * compare numbers with ranges\n  * reason medically\n  * explain itself clearly\n  * stay consistent across follow-up turns\n\n\n\nThat is much harder than the benchmark-style tasks used in official evaluations. (Google for Developers)\n\n### 2. LM Studio may be part of the problem too\n\nThis is especially relevant for your blood-report screenshot test. LM Studio has had vision-image resizing issues, and its changelog shows that image input sizing has been changed and made configurable over time. It now exposes **Settings → Chat → Image Inputs → Image resize bounds** , and earlier reports described automatic downsizing that hurt OCR-like tasks and spatial accuracy. If a report image is resized too aggressively, tiny values, decimal points, units, and row boundaries can get harder for the model to read. (LM Studio)\n\nSo your failures may be coming from **both** sides:\n\n  * the model is not ideal for your chat-style workflow\n  * the host app may be degrading the input before inference\n\n\n\nThat combination is especially bad for screenshots and subtle medical images. (LM Studio)\n\n## Skin mole: useful or useless?\n\nFor **casual mole assessment from one image** , I would say: **mostly not useful**.\n\nWhy:\n\nMedGemma does have dermatology training data and it scores **73.5%** on Google’s internal US-DermMCQA benchmark, so it is not random in dermatology. But that benchmark is not the same thing as “reliably assess this one phone photo of my mole in a chat window.” (Hugging Face)\n\nIndependent dermatology research is a good match for what you saw. A recent paper focusing on dermatology and MedGemma-4B found a gap between strong medical vision encoders and weaker full VLM diagnostic behavior, and says these models can **over-rely on language priors** , producing plausible-sounding answers without grounding enough in the image itself. (OpenReview)\n\nThat is almost exactly your result: the model gives some descriptive medical-sounding remarks, then falls back to “ask a doctor.” That is not surprising. It is the behavior of a model that has some dermatology competence but not enough reliable image-grounded certainty to make the answer useful for your purpose. (OpenReview)\n\n### My judgment for the mole case\n\nFor your exact use case, treat it as:\n\n  * **bad for diagnosis-like assessment**\n  * **possibly okay for structured description**\n  * **possibly okay for tracking or comparing photos over time**\n  * **not something I would rely on for deciding whether a lesion is concerning**\n\n\n\nThat fits both the official limitations and the current dermatology literature. (Google for Developers)\n\n## Blood test report: useful or useless?\n\nFor **blood test screenshots in LM Studio chat** , I would say: **mostly useless in its current form**.\n\nWhy:\n\nThis task is really several tasks stacked together:\n\n  1. OCR or image reading\n  2. table/row parsing\n  3. linking each value to the correct analyte\n  4. linking the correct unit\n  5. linking the correct reference range\n  6. doing the numerical comparison\n  7. explaining the result\n\n\n\nIf any one of those steps fails, the final answer can be wrong while still sounding confident. That is exactly the pattern you described. (Google for Developers)\n\nNow, to be fair to the model, MedGemma 1.5 **does** show good document-understanding benchmark numbers. On Google’s published evaluations, it gets **91.0 macro F1** on one raw PDF-to-JSON lab dataset, **71.0** on another, and **85.0** on the Mendeley lab-report PNG benchmark. But notice the framing: these are **structured extraction** tasks, not free-form screenshot chat. The technical report explicitly describes this capability as converting report images/PDFs into **structured JSON** , and even shows a prompt template that says: “You are a Clinical Data Extraction Specialist… extract all lab tests into a JSON list.” (Hugging Face)\n\nThat distinction is crucial. Your workflow was:\n\n> show screenshot → ask for interpretation\n\nThe official evaluation workflow is closer to:\n\n> show document → extract structured fields → do downstream processing\n\nThose are not the same. (arXiv)\n\n### My judgment for the blood-report case\n\nFor your current workflow, yes: **close to useless**.\n\nNot because MedGemma has no document skill, but because:\n\n  * screenshot reading is fragile\n  * LM Studio image resizing may hurt OCR-like accuracy\n  * free-form reasoning hides extraction mistakes\n  * multi-turn correction is not a strength of this model\n\n\n\nSo I would not use it as “medical report reader and interpreter” in the form you tried. (LM Studio)\n\n## Is there a key to make it “look more carefully”?\n\nThere is **no magic prompt**.\n\nThere are only better **workflows**.\n\nThe model card says MedGemma is not optimized for multi-turn use and is more sensitive to prompting than Gemma 3. So asking “look again,” “are you sure,” and “justify that” does not reliably make it inspect the image better. It often just makes it produce a different answer. (Google for Developers)\n\nThe real keys are:\n\n### 1. Use single-turn prompts\n\nStart a fresh prompt for each task instead of trying to repair a weak answer in long chat. That follows directly from the model’s multi-turn limitation and prompt sensitivity. (Google for Developers)\n\n### 2. Split extraction from interpretation\n\nFor reports, do **not** ask it to read, compare, and explain in one pass. First ask it to extract structured fields. Then separately compare values to ranges. This aligns with the document-understanding setup described in the technical report. (arXiv)\n\n### 3. Use structured outputs, not prose\n\nFree-form prose hides errors. JSON exposes them.\n\nFor blood reports, ask for only:\n\n  * test name\n  * measured value\n  * unit\n  * lower range\n  * upper range\n  * exact source text\n\n\n\nThen do interpretation in a second pass. (arXiv)\n\n### 4. Lower randomness\n\nThe Hugging Face card says the generation config was updated on **January 23, 2026** to use **greedy decoding by default**. For extraction-like work, low temperature / deterministic decoding is the safer choice. It will not make the model smarter, but it reduces drift. (Hugging Face)\n\n### 5. Improve the input\n\nFor skin photos: better lighting, sharper focus, tighter crop.\nFor reports: use original PDF or pasted text when possible, not screenshots.\nFor X-rays: use the original radiology image or DICOM if possible, not a photo of a monitor.\nThis matters even more in LM Studio because image resize settings affect what the model actually receives. (LM Studio)\n\n## What can you do better to make use of the model?\n\nHere is the practical answer.\n\n### For skin images\n\nDo **not** ask:\n\n> “Assess this mole.”\n\nAsk instead:\n\n> “Describe only visible features. Do not diagnose. Return: image-quality issues, asymmetry yes/no/unclear, border irregularity yes/no/unclear, color variation yes/no/unclear, ulceration yes/no/unclear, and confidence.”\n\nWhy this is better:\n\n  * it makes the task narrower\n  * it reduces hallucinated certainty\n  * it plays closer to “structured observation” than diagnosis\n  * it is more aligned with the known weakness in dermatology grounding (OpenReview)\n\n\n\n### For blood reports\n\nDo this in two passes.\n\n**Pass 1**\n\n> Extract every analyte into JSON with name, value, unit, ref_low, ref_high, source_text_exact. If unclear, use null. Do not interpret.\n\n**Pass 2**\n\n> Using only that JSON, classify each analyte as low / in range / high. If ref_low or ref_high is missing, say uncertain.\n\nWhy this is better:\n\n  * it separates reading from reasoning\n  * it matches the official report-extraction framing\n  * it lets you spot row-matching mistakes before interpretation (arXiv)\n\n\n\n### For general use\n\nUse it for:\n\n  * structured extraction\n  * structured image description\n  * constrained yes/no or closed-form questions\n  * comparing one current image with one prior image\n\n\n\nDo **not** use it for:\n\n  * broad clinical judgment\n  * diagnosis from casual images\n  * screenshot-heavy OCR-plus-interpretation in one pass\n  * long corrective chats trying to force a better answer (Google for Developers)\n\n\n\n## Has someone tested it with X-rays?\n\nYes, and this is the strongest part of the story.\n\nOfficially, MedGemma 1.5 added support for:\n\n  * **longitudinal chest X-rays**\n  * **anatomical localization with bounding boxes**\n  * improved chest-X-ray interpretation tasks (Google for Developers)\n\n\n\nOn Google’s published imaging evaluations, MedGemma 1.5 4B scores:\n\n  * **89.5 macro F1** on MIMIC CXR top-5 classification\n  * **65.7 macro accuracy** on the MS-CXR-T longitudinal disease-progression task\n  * **38.0 IoU** on Chest ImaGenome anatomy bounding-box detection (Hugging Face)\n\n\n\nThere is also an independent benchmark, **ReXVQA** , for chest-X-ray visual question answering. That paper reports MedGemma as the top-performing model tested there, at **83.24% overall accuracy** , and reports **83.84%** in its human-comparison reader study versus **77.27%** for the best radiology resident in that sample. (arXiv)\n\nSo yes: **people have tested it with X-rays** , and this is the domain where the evidence looks most encouraging. (arXiv)\n\n## But there is an important X-ray caveat\n\n“Good on chest X-rays” does **not** mean “great radiology report writer.”\n\nThe model card shows that MedGemma 1.5 4B gets **27.2 RadGraph F1** on MIMIC-CXR report generation, compared with **30.3** for a **fine-tuned** MedGemma 1 4B chest-X-ray model. So the public evidence suggests it is more convincing for **classification, localization, and constrained X-ray QA** than for generating a perfect free-form report. (Hugging Face)\n\nThere is also a recent paper, **PSF-Med** , showing that medical VLMs can change their answers when the same chest-X-ray question is rephrased. That means even in X-rays, prompt wording still matters and there is no perfect “careful mode.” (arXiv)\n\n## My case-specific judgment\n\nIf I map all of this onto your exact situation:\n\n### Skin mole\n\nYour result was **expected**. The model is not useless in dermatology, but it is not dependable enough for the way you used it. For casual mole assessment from a single image, I would mostly stop using it. (OpenReview)\n\n### Blood test report\n\nYour result was also **expected**. This is the worst fit of the three because screenshot OCR, layout parsing, and medical interpretation are all mixed together. In LM Studio, with image resize effects in the loop, I would mostly stop using it this way. (LM Studio)\n\n### Chest X-rays\n\nThis is the one I would keep testing. Not as a general radiologist replacement, but as a **scoped X-ray assistant** for things like:\n\n  * “Is there a pleural effusion?”\n  * “Compare current vs prior image.”\n  * “Where is the support device?”\n  * “List visible findings only.”\nThat is much closer to the capability area where MedGemma 1.5 has the strongest support. (Hugging Face)\n\n\n\n## Simplest conclusion\n\nIf you want the shortest honest answer:\n\n  * **Mole photo:** mostly not worth it.\n  * **Blood report screenshot:** mostly not worth it.\n  * **X-rays:** worth trying, but only in a narrow, structured way.\n\n\n\nSo the answer is not “MedGemma 1.5 4B is useless.”\nIt is:\n\n> **For your current LM Studio chat workflow, it is badly matched to mole assessment and screenshot blood reports, but it may be genuinely useful for constrained chest-X-ray tasks or structured extraction workflows.** (Google for Developers)",
  "title": "Medgemma 1.5 4b, useful?"
}