More performant and scalable: Rethinking contrastive vision-language pre-training of radiology in the LLM era
This work addresses the problem of costly and limited supervised pre-training for medical AI, particularly in radiology, by using LLMs to democratize access to large-scale datasets, though it is incremental in applying existing methods to a new domain.
The paper tackled the challenge of improving contrastive vision-language pre-training for radiology by leveraging LLMs to automatically extract diagnostic labels from reports with high precision (>96% AUC), enabling low-cost creation of large-scale datasets (~$3 for 50k pairs). The result was state-of-the-art performance, including 83.8% AUC for zero-shot diagnosis on CT-RATE and 77.3% AUC on RAD-ChestCT, demonstrating more performant and scalable medical AI systems.
The emergence of Large Language Models (LLMs) presents unprecedented opportunities to revolutionize medical contrastive vision-language pre-training. In this paper, we show how LLMs can facilitate large-scale supervised pre-training, thereby advancing vision-language alignment. We begin by demonstrate that modern LLMs can automatically extract diagnostic labels from radiology reports with remarkable precision (>96\% AUC in our experiments) without complex prompt engineering, enabling the creation of large-scale "silver-standard" datasets at a minimal cost (~\$3 for 50k CT image-report pairs). Further, we find that vision encoder trained on this "silver-standard" dataset achieves performance comparable to those trained on labels extracted by specialized BERT-based models, thereby democratizing the access to large-scale supervised pre-training. Building on this foundation, we proceed to reveal that supervised pre-training fundamentally improves contrastive vision-language alignment. Our approach achieves state-of-the-art performance using only a 3D ResNet-18 with vanilla CLIP training, including 83.8\% AUC for zero-shot diagnosis on CT-RATE, 77.3\% AUC on RAD-ChestCT, and substantial improvements in cross-modal retrieval (MAP@50=53.7\% for image-image, Recall@100=52.2\% for report-image). These results demonstrate the potential of utilizing LLMs to facilitate {\bf more performant and scalable} medical AI systems. Our code is avaiable at https://github.com/SadVoxel/More-performant-and-scalable.