IVCVLGMar 18, 2025

Advancing Medical Representation Learning Through High-Quality Data

arXiv:2503.14377v15 citationsh-index: 7Has CodeMICCAI
Originality Synthesis-oriented
AI Analysis

This addresses the need for better data curation in multimodal medical AI, though it is incremental as it focuses on dataset quality rather than a new method.

The paper tackled the problem of dataset quality in medical vision-language models by introducing Open-PMC, a high-quality dataset with 2.2 million image-text pairs, and showed that quality drives significant performance gains in retrieval and zero-shot classification tasks.

Despite the growing scale of medical Vision-Language datasets, the impact of dataset quality on model performance remains under-explored. We introduce Open-PMC, a high-quality medical dataset from PubMed Central, containing 2.2 million image-text pairs, enriched with image modality annotations, subfigures, and summarized in-text references. Notably, the in-text references provide richer medical context, extending beyond the abstract information typically found in captions. Through extensive experiments, we benchmark Open-PMC against larger datasets across retrieval and zero-shot classification tasks. Our results show that dataset quality-not just size-drives significant performance gains. We complement our benchmark with an in-depth analysis of feature representation. Our findings highlight the crucial role of data curation quality in advancing multimodal medical AI. We release Open-PMC, along with the trained models and our codebase.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes