LGCYFeb 13, 2023

Provable Detection of Propagating Sampling Bias in Prediction Models

arXiv:2302.06752v17 citationsh-index: 33
Originality Highly original
AI Analysis

This addresses the need for provable fairness detection in machine learning pipelines, particularly in high-stakes domains like criminal justice, though it is incremental by building on prior qualitative work.

The paper tackles the problem of how differential sampling bias in training data propagates to prediction models, providing a theoretical quantification of this propagation and proving conditions under which such bias becomes detectable by auditors, with validation on criminal justice datasets like COMPAS and NYPD stop and frisk data.

With an increased focus on incorporating fairness in machine learning models, it becomes imperative not only to assess and mitigate bias at each stage of the machine learning pipeline but also to understand the downstream impacts of bias across stages. Here we consider a general, but realistic, scenario in which a predictive model is learned from (potentially biased) training data, and model predictions are assessed post-hoc for fairness by some auditing method. We provide a theoretical analysis of how a specific form of data bias, differential sampling bias, propagates from the data stage to the prediction stage. Unlike prior work, we evaluate the downstream impacts of data biases quantitatively rather than qualitatively and prove theoretical guarantees for detection. Under reasonable assumptions, we quantify how the amount of bias in the model predictions varies as a function of the amount of differential sampling bias in the data, and at what point this bias becomes provably detectable by the auditor. Through experiments on two criminal justice datasets -- the well-known COMPAS dataset and historical data from NYPD's stop and frisk policy -- we demonstrate that the theoretical results hold in practice even when our assumptions are relaxed.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes