LG MLFeb 15, 2019

The Fairness of Risk Scores Beyond Classification: Bipartite Ranking and the xAUC Metric

arXiv:1902.05826v218.284 citationsHas Code

Originality Incremental advance

AI Analysis

This work addresses fairness concerns for stakeholders in criminal justice, finance, and healthcare by providing a novel metric to evaluate risk scores beyond binary classification, though it is incremental as it builds on existing fairness characterizations.

The paper tackles the problem of fairness in machine-learned predictive risk scores used for high-stakes decisions by introducing the xAUC metric to assess disparate impact in bipartite ranking tasks, and applies it to audit risk scores in recidivism, income, and cardiac arrest prediction, revealing disparities not evident from within-group comparisons.

Where machine-learned predictive risk scores inform high-stakes decisions, such as bail and sentencing in criminal justice, fairness has been a serious concern. Recent work has characterized the disparate impact that such risk scores can have when used for a binary classification task. This may not account, however, for the more diverse downstream uses of risk scores and their non-binary nature. To better account for this, in this paper, we investigate the fairness of predictive risk scores from the point of view of a bipartite ranking task, where one seeks to rank positive examples higher than negative ones. We introduce the xAUC disparity as a metric to assess the disparate impact of risk scores and define it as the difference in the probabilities of ranking a random positive example from one protected group above a negative one from another group and vice versa. We provide a decomposition of bipartite ranking loss into components that involve the discrepancy and components that involve pure predictive ability within each group. We use xAUC analysis to audit predictive risk scores for recidivism prediction, income prediction, and cardiac arrest prediction, where it describes disparities that are not evident from simply comparing within-group predictive performance.

View on arXiv PDF Code

Similar