CV AIMay 21

Towards Clinically Interpretable Ophthalmic VQA via Spatially-Grounded Lesion Evidence

Xingyue Wang, Bo Liu, Meng Wang, Zhixuan Zhang, Chengcheng Zhu, Huazhu Fu, Jiang Liu

arXiv:2605.2241460.6

Predicted impact top 39% in CV · last 90 daysOriginality Synthesis-oriented

AI Analysis

For ophthalmologists and AI researchers, this work addresses the lack of interpretability in ophthalmic VQA by providing a benchmark with spatially-grounded lesion evidence, though it is an incremental step within the domain.

The paper introduces FundusGround, a benchmark for clinically interpretable ophthalmic VQA with spatially-grounded lesion evidence, collecting 10,719 fundus images and 72,706 questions. Experiments show that incorporating lesion-level visual evidence improves model performance and transparency.

Visual Question Answering (VQA) holds great promise for clinical support, particularly in ophthalmology, where retinal fundus photography is essential for diagnosis. However, ophthalmic VQA benchmarks primarily emphasize answer accuracy, neglecting the explicit visual evidence necessary for clinical interpretability. In this work, we introduce FundusGround, a new benchmark for clinically interpretable ophthalmic VQA with spatially-grounded lesion evidence. Specifically, we propose a three-stage pipeline that collects 10,719 fundus images with 15,595 image-level meticulously annotated lesions. To ensure anatomical consistency and clinical validity, all lesions are spatially localized using the Early Treatment Diabetic Retinopathy Study (ETDRS) grid, enabling standardized mapping to nine clinically meaningful retinal regions. Built upon this structured lesion evidence, 72,706 questions are then generated spanning four formats: open-ended, closed-ended, single-choice, and multiple-choice. We further benchmark multiple general- and medical- large vision-language models using dual metrics for answer accuracy and lesion-level reasoning. The experiments demonstrate that incorporating lesion-level visual evidence consistently improves model performance and transparency, highlighting the necessity of explicit spatial grounding for reliable and explainable ophthalmic VQA.

View on arXiv PDF

Similar