A Spectrum Evaluation Benchmark for Medical Multi-Modal Large Language Models
This work addresses the problem of inadequate evaluation for Med-MLLMs in healthcare, which is crucial for their safe clinical deployment, though it is incremental as it focuses on benchmarking rather than model development.
The authors tackled the lack of suitable benchmarks for evaluating Medical Multi-Modal Large Language Models (Med-MLLMs) by introducing Asclepius, a novel benchmark that assesses these models across 15 medical specialties and various diagnostic capacities, comparing 6 Med-MLLMs with 3 human specialists to reveal their competencies and limitations.
The significant breakthroughs of Medical Multi-Modal Large Language Models (Med-MLLMs) renovate modern healthcare with robust information synthesis and medical decision support. However, these models are often evaluated on benchmarks that are unsuitable for the Med-MLLMs due to the complexity of real-world diagnostics across diverse specialties. To address this gap, we introduce Asclepius, a novel Med-MLLM benchmark that comprehensively assesses Med-MLLMs in terms of: distinct medical specialties (cardiovascular, gastroenterology, etc.) and different diagnostic capacities (perception, disease analysis, etc.). Grounded in 3 proposed core principles, Asclepius ensures a comprehensive evaluation by encompassing 15 medical specialties, stratifying into 3 main categories and 8 sub-categories of clinical tasks, and exempting overlap with existing VQA dataset. We further provide an in-depth analysis of 6 Med-MLLMs and compare them with 3 human specialists, providing insights into their competencies and limitations in various medical contexts. Our work not only advances the understanding of Med-MLLMs' capabilities but also sets a precedent for future evaluations and the safe deployment of these models in clinical environments.