CVJul 15, 2025

How Far Have Medical Vision-Language Models Come? A Comprehensive Benchmarking Study

Che Liu, Jiazhen Pan, Weixiang Shen, Wenjia Bai, Daniel Rueckert, Rossella Arcucci

arXiv:2507.11200v26.21 citationsh-index: 30Has Code

Originality Synthesis-oriented

AI Analysis

This work identifies critical gaps in medical AI for safe decision support, highlighting the need for improved multimodal alignment and evaluation, though it is incremental as it builds on existing benchmarking efforts.

The study benchmarked open-source vision-language models on medical tasks, finding that large general-purpose models often match or surpass medical-specific ones in zero-shot transfer, but reasoning performance lags behind understanding, and no model meets clinical reliability thresholds.

Vision-Language Models (VLMs) trained on web-scale corpora excel at natural image tasks and are increasingly repurposed for healthcare; however, their competence in medical tasks remains underexplored. We present a comprehensive evaluation of open-source general-purpose and medically specialised VLMs, ranging from 3B to 72B parameters, across eight benchmarks: MedXpert, OmniMedVQA, PMC-VQA, PathVQA, MMMU, SLAKE, and VQA-RAD. To observe model performance across different aspects, we first separate it into understanding and reasoning components. Three salient findings emerge. First, large general-purpose models already match or surpass medical-specific counterparts on several benchmarks, demonstrating strong zero-shot transfer from natural to medical images. Second, reasoning performance is consistently lower than understanding, highlighting a critical barrier to safe decision support. Third, performance varies widely across benchmarks, reflecting differences in task design, annotation quality, and knowledge demands. No model yet reaches the reliability threshold for clinical deployment, underscoring the need for stronger multimodal alignment and more rigorous, fine-grained evaluation protocols.

View on arXiv PDF

Similar