When Are Learning Biases Equivalent? A Unifying Framework for Fairness, Robustness, and Distribution Shift
This work bridges the literature on fairness, robustness, and distribution shifts under a common perspective, addressing a foundational problem for the ML/AI community by providing a unifying framework.
The paper tackles the problem of diverse failure modes in machine learning systems, such as unfairness, brittleness, and poor performance on minority groups, by proposing a unifying theoretical framework that characterizes when different bias mechanisms produce quantitatively equivalent effects on model performance, with empirical validation showing predicted equivalences hold within 3% accuracy for worst-group performance.
Machine learning systems exhibit diverse failure modes: unfairness toward protected groups, brittleness to spurious correlations, poor performance on minority sub-populations, which are typically studied in isolation by distinct research communities. We propose a unifying theoretical framework that characterizes when different bias mechanisms produce quantitatively equivalent effects on model performance. By formalizing biases as violations of conditional independence through information-theoretic measures, we prove formal equivalence conditions relating spurious correlations, subpopulation shift, class imbalance, and fairness violations. Our theory predicts that a spurious correlation of strength $α$ produces equivalent worst-group accuracy degradation as a sub-population imbalance ratio $r \approx (1+α)/(1-α)$ under feature overlap assumptions. Empirical validation in six datasets and three architectures confirms that predicted equivalences hold within the accuracy of the worst group 3\%, enabling the principled transfer of debiasing methods across problem domains. This work bridges the literature on fairness, robustness, and distribution shifts under a common perspective.