CVSep 30, 2025

LMOD+: A Comprehensive Multimodal Dataset and Benchmark for Developing and Evaluating Multimodal Large Language Models in Ophthalmology

Zhenyue Qin, Yang Liu, Yu Yin, Jinyu Ding, Haoran Zhang, Anran Li, Dylan Campbell, Xuansheng Wu, Ke Zou, Tiarnan D. L. Keenan, Emily Y. Chew, Zhiyong Lu

arXiv:2509.25620v12 citationsh-index: 54

Originality Synthesis-oriented

AI Analysis

This addresses the problem of limited datasets for developing and evaluating medical AI models in ophthalmology, which could help reduce the global burden of vision-threatening diseases, though it is incremental as an extension of their preliminary benchmark.

The authors tackled the lack of comprehensive benchmark datasets for evaluating multimodal large language models in ophthalmology by presenting LMOD+, a large-scale multimodal dataset with 32,633 instances across 12 conditions and 5 imaging modalities. The result includes evaluations showing top models achieved ~58% accuracy in disease screening under zero-shot settings, though performance remained suboptimal for tasks like disease staging.

Vision-threatening eye diseases pose a major global health burden, with timely diagnosis limited by workforce shortages and restricted access to specialized care. While multimodal large language models (MLLMs) show promise for medical image interpretation, advancing MLLMs for ophthalmology is hindered by the lack of comprehensive benchmark datasets suitable for evaluating generative models. We present a large-scale multimodal ophthalmology benchmark comprising 32,633 instances with multi-granular annotations across 12 common ophthalmic conditions and 5 imaging modalities. The dataset integrates imaging, anatomical structures, demographics, and free-text annotations, supporting anatomical structure recognition, disease screening, disease staging, and demographic prediction for bias evaluation. This work extends our preliminary LMOD benchmark with three major enhancements: (1) nearly 50% dataset expansion with substantial enlargement of color fundus photography; (2) broadened task coverage including binary disease diagnosis, multi-class diagnosis, severity classification with international grading standards, and demographic prediction; and (3) systematic evaluation of 24 state-of-the-art MLLMs. Our evaluations reveal both promise and limitations. Top-performing models achieved ~58% accuracy in disease screening under zero-shot settings, and performance remained suboptimal for challenging tasks like disease staging. We will publicly release the dataset, curation pipeline, and leaderboard to potentially advance ophthalmic AI applications and reduce the global burden of vision-threatening diseases.

View on arXiv PDF

Similar