CVJan 27, 2025

MM-Retinal V2: Transfer an Elite Knowledge Spark into Fundus Vision-Language Pretraining

arXiv:2501.15798v18 citationsh-index: 8Has Code
Originality Incremental advance
AI Analysis

This work addresses the challenge of data scarcity and pretraining limitations in medical imaging for ophthalmology, though it is incremental as it builds on existing vision-language pretraining methods.

The authors tackled the problem of limited generalization in fundus image analysis by introducing MM-Retinal V2, a high-quality public dataset with multiple image modalities, and KeepFIT V2, a vision-language pretraining model that integrates elite knowledge into categorical datasets, achieving performance competitive with state-of-the-art models trained on large-scale private data.

Vision-language pretraining (VLP) has been investigated to generalize across diverse downstream tasks for fundus image analysis. Although recent methods showcase promising achievements, they significantly rely on large-scale private image-text data but pay less attention to the pretraining manner, which limits their further advancements. In this work, we introduce MM-Retinal V2, a high-quality image-text paired dataset comprising CFP, FFA, and OCT image modalities. Then, we propose a novel fundus vision-language pretraining model, namely KeepFIT V2, which is pretrained by integrating knowledge from the elite data spark into categorical public datasets. Specifically, a preliminary textual pretraining is adopted to equip the text encoder with primarily ophthalmic textual knowledge. Moreover, a hybrid image-text knowledge injection module is designed for knowledge transfer, which is essentially based on a combination of global semantic concepts from contrastive learning and local appearance details from generative learning. Extensive experiments across zero-shot, few-shot, and linear probing settings highlight the generalization and transferability of KeepFIT V2, delivering performance competitive to state-of-the-art fundus VLP models trained on large-scale private image-text datasets. Our dataset and model are publicly available via https://github.com/lxirich/MM-Retinal.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes