Detecting Dataset Bias in Medical AI: A Generalized and Modality-Agnostic Auditing Framework
This addresses the issue of AI bias in healthcare deployment for developers and practitioners, offering a practical tool to improve reliability, though it is incremental as it builds on existing bias detection concepts.
The paper tackles the problem of dataset bias in medical AI by introducing G-AUDIT, a modality-agnostic auditing framework that identifies subtle biases in training and testing data across three medical modalities, successfully uncovering overlooked risks in skin lesion classification, EHR language classification, and ICU mortality prediction.
Artificial Intelligence (AI) is now firmly at the center of evidence-based medicine. Despite many success stories that edge the path of AI's rise in healthcare, there are comparably many reports of significant shortcomings and unexpected behavior of AI in deployment. A major reason for these limitations is AI's reliance on association-based learning, where non-representative machine learning datasets can amplify latent bias during training and/or hide it during testing. To unlock new tools capable of foreseeing and preventing such AI bias issues, we present G-AUDIT. Generalized Attribute Utility and Detectability-Induced bias Testing (G-AUDIT) for datasets is a modality-agnostic dataset auditing framework that allows for generating targeted hypotheses about sources of bias in training or testing data. Our method examines the relationship between task-level annotations (commonly referred to as ``labels'') and data properties including patient attributes (e.g., age, sex) and environment/acquisition characteristics (e.g., clinical site, imaging protocols). G-AUDIT quantifies the extent to which the observed data attributes pose a risk for shortcut learning, or in the case of testing data, might hide predictions made based on spurious associations. We demonstrate the broad applicability of our method by analyzing large-scale medical datasets for three distinct modalities and machine learning tasks: skin lesion classification in images, stigmatizing language classification in Electronic Health Records (EHR), and mortality prediction for ICU tabular data. In each setting, G-AUDIT successfully identifies subtle biases commonly overlooked by traditional qualitative methods, underscoring its practical value in exposing dataset-level risks and supporting the downstream development of reliable AI systems.