CVLGJun 11, 2024

Benchmarking Vision-Language Contrastive Methods for Medical Representation Learning

arXiv:2406.07450v11 citations
Originality Synthesis-oriented
AI Analysis

This work provides a comprehensive benchmark for medical representation learning, addressing key questions for researchers in medical AI, though it is incremental as it focuses on evaluating existing methods.

The paper benchmarked eight contrastive learning methods on medical vision-language tasks using 2.8 million image-text pairs, finding that general-domain representations transfer well to medical tasks, multimodal training alone is sufficient without unimodal training, and fine-grained features improve performance.

We perform a comprehensive benchmarking of contrastive frameworks for learning multimodal representations in the medical domain. Through this study, we aim to answer the following research questions: (i) How transferable are general-domain representations to the medical domain? (ii) Is multimodal contrastive training sufficient, or does it benefit from unimodal training as well? (iii) What is the impact of feature granularity on the effectiveness of multimodal medical representation learning? To answer these questions, we investigate eight contrastive learning approaches under identical training setups, and train them on 2.8 million image-text pairs from four datasets, and evaluate them on 25 downstream tasks, including classification (zero-shot and linear probing), image-to-text and text-to-image retrieval, and visual question-answering. Our findings suggest a positive answer to the first question, a negative answer to the second question, and the benefit of learning fine-grained features. Finally, we make our code publicly available.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes