Med-R2: An Adversarial Benchmark for Evidence-Grounded Reasoning in Medical VLMs
For medical AI researchers, it provides a rigorous benchmark to assess whether VLMs rely on spurious priors rather than evidence-grounded reasoning, addressing a critical gap in interpretability.
The paper introduces Med-R2 Bench, a hierarchical benchmark with 42,432 images and 110,406 QA pairs to evaluate adversarial robustness and visual grounding in medical VLMs. Evaluation across 14 models reveals sequential performance degradation across clinical stages and heavy reliance on prompts, with stepwise fine-tuning significantly improving reasoning robustness.
Vision-language models have demonstrated impressive capabilities in general medical visual question answering, yet due to limited interpretability, it remains unclear whether their predictions reflect evidence-grounded clinical reasoning or reliance on spurious priors. We introduce Med-R2 Bench, a hierarchical benchmark aligned with the clinical workflow to evaluate adversarial robustness with visual grounding. We design stepwise QA tasks to assess whether reasoning chains are strictly grounded in visual evidence across the four clinical stages, and employ adversarial perturbations to test robustness against misleading cues. Med-R2 comprises 42,432 images, 31 task categories, and 110,406 QA pairs. Evaluation across 14 VLMs reveals a sequential performance degradation along the four-stage clinical workflow. Adversarial experiments show that models rely heavily on correct prompts to guess answers. Even when provided with explicit visual cues, the models struggle to accurately align textual descriptions. Finally, we demonstrate stepwise fine-tuning using our hierarchical data significantly improves reasoning robustness, highlighting its potential to drive future improvements in evidence-based medical AI.