CVMar 30, 2023

Vision-Language Modelling For Radiological Imaging and Reports In The Low Data Regime

arXiv:2303.17644v116 citationsh-index: 188
Originality Incremental advance
AI Analysis

This addresses the challenge of scarce training data for medical AI applications, which is a common bottleneck in clinical settings.

This paper tackles the problem of training medical vision-language models with limited clinical data by exploring methods like domain adaptation, contrastive losses, and extra supervision. The combined approach significantly improves text-to-image retrieval performance compared to fine-tuning CLIP and outperforms CLIP and BioVIL in chest X-ray condition classification tasks.

This paper explores training medical vision-language models (VLMs) -- where the visual and language inputs are embedded into a common space -- with a particular focus on scenarios where training data is limited, as is often the case in clinical datasets. We explore several candidate methods to improve low-data performance, including: (i) adapting generic pre-trained models to novel image and text domains (i.e. medical imaging and reports) via unimodal self-supervision; (ii) using local (e.g. GLoRIA) & global (e.g. InfoNCE) contrastive loss functions as well as a combination of the two; (iii) extra supervision during VLM training, via: (a) image- and text-only self-supervision, and (b) creating additional positive image-text pairs for training through augmentation and nearest-neighbour search. Using text-to-image retrieval as a benchmark, we evaluate the performance of these methods with variable sized training datasets of paired chest X-rays and radiological reports. Combined, they significantly improve retrieval compared to fine-tuning CLIP, roughly equivalent to training with the data. A similar pattern is found in the downstream task classification of CXR-related conditions with our method outperforming CLIP and also BioVIL, a strong CXR VLM benchmark, in the zero-shot and linear probing settings. We conclude with a set of recommendations for researchers aiming to train vision-language models on other medical imaging modalities when training data is scarce. To facilitate further research, we will make our code and models publicly available.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes