Fairer and more accurate, but for whom?
This addresses the need for more nuanced model evaluation in domains like criminal justice and finance to ensure fairness, though it is incremental as it builds on existing fairness and comparison methods.
The paper tackles the problem of comparing machine learning models for high-stakes decisions by introducing a framework to identify subgroups where models differ most in fairness metrics, such as racial or gender disparities, rather than relying on overall performance metrics. It demonstrates this with experimental results from recidivism prediction and lending tasks.
Complex statistical machine learning models are increasingly being used or considered for use in high-stakes decision-making pipelines in domains such as financial services, health care, criminal justice and human services. These models are often investigated as possible improvements over more classical tools such as regression models or human judgement. While the modeling approach may be new, the practice of using some form of risk assessment to inform decisions is not. When determining whether a new model should be adopted, it is therefore essential to be able to compare the proposed model to the existing approach across a range of task-relevant accuracy and fairness metrics. Looking at overall performance metrics, however, may be misleading. Even when two models have comparable overall performance, they may nevertheless disagree in their classifications on a considerable fraction of cases. In this paper we introduce a model comparison framework for automatically identifying subgroups in which the differences between models are most pronounced. Our primary focus is on identifying subgroups where the models differ in terms of fairness-related quantities such as racial or gender disparities. We present experimental results from a recidivism prediction task and a hypothetical lending example.