U2-BENCH: Benchmarking Large Vision-Language Models on Ultrasound Understanding
This work addresses the need for standardized evaluation of LVLMs in medical ultrasound, a critical healthcare domain, but is incremental as it applies existing benchmarking approaches to a new modality.
The authors tackled the problem of evaluating large vision-language models (LVLMs) on ultrasound understanding, which is challenging due to image quality issues and lack of prior benchmarks, by introducing U2-BENCH, a comprehensive benchmark with 7,241 cases across 15 anatomical regions and 8 tasks, revealing strong performance in image-level classification but persistent challenges in spatial reasoning and clinical language generation.
Ultrasound is a widely-used imaging modality critical to global healthcare, yet its interpretation remains challenging due to its varying image quality on operators, noises, and anatomical structures. Although large vision-language models (LVLMs) have demonstrated impressive multimodal capabilities across natural and medical domains, their performance on ultrasound remains largely unexplored. We introduce U2-BENCH, the first comprehensive benchmark to evaluate LVLMs on ultrasound understanding across classification, detection, regression, and text generation tasks. U2-BENCH aggregates 7,241 cases spanning 15 anatomical regions and defines 8 clinically inspired tasks, such as diagnosis, view recognition, lesion localization, clinical value estimation, and report generation, across 50 ultrasound application scenarios. We evaluate 20 state-of-the-art LVLMs, both open- and closed-source, general-purpose and medical-specific. Our results reveal strong performance on image-level classification, but persistent challenges in spatial reasoning and clinical language generation. U2-BENCH establishes a rigorous and unified testbed to assess and accelerate LVLM research in the uniquely multimodal domain of medical ultrasound imaging.