LG MEJul 3, 2023

Systematic Bias in Sample Inference and its Effect on Machine Learning

arXiv:2307.01384v11 citationsh-index: 18

Originality Incremental advance

AI Analysis

This addresses fairness and bias issues in ML for minority groups, but is incremental as it builds on known statistical principles.

The paper tackles the problem of systematic underprediction in machine learning models, particularly for minority groups, by attributing it to statistical bias from small-sample inference, and demonstrates this with correlations of 0.56 and 0.85 between predicted bias and observed underprediction rates in datasets.

A commonly observed pattern in machine learning models is an underprediction of the target feature, with the model's predicted target rate for members of a given category typically being lower than the actual target rate for members of that category in the training set. This underprediction is usually larger for members of minority groups; while income level is underpredicted for both men and women in the 'adult' dataset, for example, the degree of underprediction is significantly higher for women (a minority in that dataset). We propose that this pattern of underprediction for minorities arises as a predictable consequence of statistical inference on small samples. When presented with a new individual for classification, an ML model performs inference not on the entire training set, but on a subset that is in some way similar to the new individual, with sizes of these subsets typically following a power law distribution so that most are small (and with these subsets being necessarily smaller for the minority group). We show that such inference on small samples is subject to systematic and directional statistical bias, and that this bias produces the observed patterns of underprediction seen in ML models. Analysing a standard sklearn decision tree model's predictions on a set of over 70 subsets of the 'adult' and COMPAS datasets, we found that a bias prediction measure based on small-sample inference had a significant positive correlations (0.56 and 0.85) with the observed underprediction rate for these subsets.

View on arXiv PDF

Similar