CVLGMar 19

ReXInTheWild: A Unified Benchmark for Medical Photograph Understanding

arXiv:2603.1951756.6h-index: 38
AI Analysis

It addresses the problem of assessing AI models for medical image interpretation in telemedicine, providing a clinically grounded benchmark, though it is incremental as it focuses on evaluation rather than new methods.

The paper tackles the lack of a comprehensive benchmark for evaluating vision-language models on medical photographs, introducing ReXInTheWild with 955 clinician-verified questions across 484 images, where leading models like Gemini-3 achieve 78% accuracy while MedGemma scores only 37%.

Everyday photographs taken with ordinary cameras are already widely used in telemedicine and other online health conversations, yet no comprehensive benchmark evaluates whether vision-language models can interpret their medical content. Analyzing these images requires both fine-grained natural image understanding and domain-specific medical reasoning, a combination that challenges both general-purpose and specialized models. We introduce ReXInTheWild, a benchmark of 955 clinician-verified multiple-choice questions spanning seven clinical topics across 484 photographs sourced from the biomedical literature. When evaluated on ReXInTheWild, leading multimodal large language models show substantial performance variation: Gemini-3 achieves 78% accuracy, followed by Claude Opus 4.5 (72%) and GPT-5 (68%), while the medical specialist model MedGemma reaches only 37%. A systematic error analysis also reveals four categories of common errors, ranging from low-level geometric errors to high-level reasoning failures and requiring different mitigation strategies. ReXInTheWild provides a challenging, clinically grounded benchmark at the intersection of natural image understanding and medical reasoning. The dataset is available on HuggingFace.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes