MedGEN-Bench: Contextually entangled benchmark for open-ended multimodal medical generation
This addresses the need for more realistic and contextually rich evaluation tools for medical AI systems, particularly for clinicians expecting integrated text and image generation, though it is incremental as it builds on existing benchmark efforts.
The paper tackles the limitations of existing medical visual benchmarks by introducing MedGEN-Bench, a comprehensive multimodal benchmark with 6,422 expert-validated image-text pairs across six imaging modalities and 16 clinical tasks, designed to evaluate open-ended generative outputs and cross-modal reasoning in medical AI.
As Vision-Language Models (VLMs) increasingly gain traction in medical applications, clinicians are progressively expecting AI systems not only to generate textual diagnoses but also to produce corresponding medical images that integrate seamlessly into authentic clinical workflows. Despite the growing interest, existing medical visual benchmarks present notable limitations. They often rely on ambiguous queries that lack sufficient relevance to image content, oversimplify complex diagnostic reasoning into closed-ended shortcuts, and adopt a text-centric evaluation paradigm that overlooks the importance of image generation capabilities. To address these challenges, we introduce MedGEN-Bench, a comprehensive multimodal benchmark designed to advance medical AI research. MedGEN-Bench comprises 6,422 expert-validated image-text pairs spanning six imaging modalities, 16 clinical tasks, and 28 subtasks. It is structured into three distinct formats: Visual Question Answering, Image Editing, and Contextual Multimodal Generation. What sets MedGEN-Bench apart is its focus on contextually intertwined instructions that necessitate sophisticated cross-modal reasoning and open-ended generative outputs, moving beyond the constraints of multiple-choice formats. To evaluate the performance of existing systems, we employ a novel three-tier assessment framework that integrates pixel-level metrics, semantic text analysis, and expert-guided clinical relevance scoring. Using this framework, we systematically assess 10 compositional frameworks, 3 unified models, and 5 VLMs.