IV CV LGMar 18, 2025

Advancing Medical Representation Learning Through High-Quality Data

Negin Baghbanzadeh, Adibvafa Fallahpour, Yasaman Parhizkar, Franklin Ogidi, Shuvendu Roy, Sajad Ashkezari, Vahid Reza Khazaie, Michael Colacci, Ali Etemad, Arash Afkanpour, Elham Dolatabadi

arXiv:2503.14377v15 citationsh-index: 7Has CodeMICCAI

Originality Synthesis-oriented

AI Analysis

This addresses the need for better data curation in multimodal medical AI, though it is incremental as it focuses on dataset quality rather than a new method.

The paper tackled the problem of dataset quality in medical vision-language models by introducing Open-PMC, a high-quality dataset with 2.2 million image-text pairs, and showed that quality drives significant performance gains in retrieval and zero-shot classification tasks.

Despite the growing scale of medical Vision-Language datasets, the impact of dataset quality on model performance remains under-explored. We introduce Open-PMC, a high-quality medical dataset from PubMed Central, containing 2.2 million image-text pairs, enriched with image modality annotations, subfigures, and summarized in-text references. Notably, the in-text references provide richer medical context, extending beyond the abstract information typically found in captions. Through extensive experiments, we benchmark Open-PMC against larger datasets across retrieval and zero-shot classification tasks. Our results show that dataset quality-not just size-drives significant performance gains. We complement our benchmark with an in-depth analysis of feature representation. Our findings highlight the crucial role of data curation quality in advancing multimodal medical AI. We release Open-PMC, along with the trained models and our codebase.

View on arXiv PDF Code

Similar