Explaining medical AI performance disparities across sites with confounder Shapley value analysis
This addresses the issue of ensuring equitable and effective AI deployment in healthcare by explaining cross-site performance disparities, though it is incremental as it builds on existing multi-site evaluation methods.
The paper tackled the problem of medical AI performance disparities across different sites by developing a framework to quantify the impact of biases like patient demographics and imaging parameters, explaining up to 60% of performance discrepancies in a pneumothorax detection case study.
Medical AI algorithms can often experience degraded performance when evaluated on previously unseen sites. Addressing cross-site performance disparities is key to ensuring that AI is equitable and effective when deployed on diverse patient populations. Multi-site evaluations are key to diagnosing such disparities as they can test algorithms across a broader range of potential biases such as patient demographics, equipment types, and technical parameters. However, such tests do not explain why the model performs worse. Our framework provides a method for quantifying the marginal and cumulative effect of each type of bias on the overall performance difference when a model is evaluated on external data. We demonstrate its usefulness in a case study of a deep learning model trained to detect the presence of pneumothorax, where our framework can help explain up to 60% of the discrepancy in performance across different sites with known biases like disease comorbidities and imaging parameters.