CVJul 28, 2025

Fairness and Robustness of CLIP-Based Models for Chest X-rays

Théo Sourget, David Restrepo, Céline Hudelot, Enzo Ferrante, Stergios Christodoulidis, Maria Vakalopoulou

arXiv:2507.21291v1h-index: 22Has CodeFAIMI@MICCAI

Originality Synthesis-oriented

AI Analysis

This work addresses fairness and robustness issues in medical AI models for radiology, which is crucial for equitable healthcare applications, but it is incremental as it focuses on evaluating existing models rather than proposing new solutions.

The study evaluated the fairness and robustness of six CLIP-based models on chest X-ray classification across three datasets, finding performance gaps by age and reliance on spurious correlations like chest drains, with more equitable results for sex and race.

Motivated by the strong performance of CLIP-based models in natural image-text domains, recent efforts have adapted these architectures to medical tasks, particularly in radiology, where large paired datasets of images and reports, such as chest X-rays, are available. While these models have shown encouraging results in terms of accuracy and discriminative performance, their fairness and robustness in the different clinical tasks remain largely underexplored. In this study, we extensively evaluate six widely used CLIP-based models on chest X-ray classification using three publicly available datasets: MIMIC-CXR, NIH-CXR14, and NEATX. We assess the models fairness across six conditions and patient subgroups based on age, sex, and race. Additionally, we assess the robustness to shortcut learning by evaluating performance on pneumothorax cases with and without chest drains. Our results indicate performance gaps between patients of different ages, but more equitable results for the other attributes. Moreover, all models exhibit lower performance on images without chest drains, suggesting reliance on spurious correlations. We further complement the performance analysis with a study of the embeddings generated by the models. While the sensitive attributes could be classified from the embeddings, we do not see such patterns using PCA, showing the limitations of these visualisation techniques when assessing models. Our code is available at https://github.com/TheoSourget/clip_cxr_fairness

View on arXiv PDF Code

Similar