CVJan 7, 2025

MedicalNarratives: Connecting Medical Vision and Language with Localized Narratives

Wisdom O. Ikezogwo, Kevin Zhang, Mehmet Saygin Seyfioglu, Fatemeh Ghezloo, Linda Shapiro, Ranjay Krishna

arXiv:2501.04184v28.43 citationsh-index: 15

Originality Incremental advance

AI Analysis

This addresses the problem of limited data for training medical AI models, enabling more integrated pretraining, though it is incremental as it builds on existing CLIP and Localized Narratives approaches.

The authors tackled the lack of large-scale datasets for medical vision-language tasks by introducing MedicalNarratives, a dataset with 4.7M image-text pairs including 1M dense annotations, and showed that their GenMedClip model outperforms previous state-of-the-art on a new medical imaging benchmark.

We propose MedicalNarratives, a dataset curated from medical pedagogical videos similar in nature to data collected in Think-Aloud studies and inspired by Localized Narratives, which collects grounded image-text data by curating instructors' speech and mouse cursor movements synchronized in time. MedicalNarratives enables pretraining of both semantic and dense objectives, alleviating the need to train medical semantic and dense tasks disparately due to the lack of reasonably sized datasets. Our dataset contains 4.7M image-text pairs from videos and articles, with 1M samples containing dense annotations in the form of traces and bounding boxes. To evaluate the utility of MedicalNarratives, we train GenMedClip based on the CLIP architecture using our dataset spanning 12 medical domains and demonstrate that it outperforms previous state-of-the-art models on a newly constructed medical imaging benchmark that comprehensively evaluates performance across all modalities. Data, demo, code and models available at https://medical-narratives.github.io

View on arXiv PDF

Similar