Model Selection's Disparate Impact in Real-World Deep Learning Applications
This addresses fairness issues in real-world ML pipelines for medical applications, but is incremental as it focuses on an under-explored source of bias.
The paper investigates how human preferences in model selection, particularly the choice of comparison metrics that ignore variability, can cause disparate impact across demographic groups in deep learning applications, demonstrating this with a medical imaging model.
Algorithmic fairness has emphasized the role of biased data in automated decision outcomes. Recently, there has been a shift in attention to sources of bias that implicate fairness in other stages in the ML pipeline. We contend that one source of such bias, human preferences in model selection, remains under-explored in terms of its role in disparate impact across demographic groups. Using a deep learning model trained on real-world medical imaging data, we verify our claim empirically and argue that choice of metric for model comparison, especially those that do not take variability into account, can significantly bias model selection outcomes.