LG CY IT MLFeb 12, 2020

To Split or Not to Split: The Impact of Disparate Treatment in Classification

Hao Wang, Hsiang Hsu, Mario Diaz, Flavio P. Calmon

arXiv:2002.04788v412.026 citations

Originality Incremental advance

AI Analysis

This addresses fairness and accuracy trade-offs in machine learning for domains where prediction accuracy is critical, but it is incremental as it builds on existing fairness frameworks.

The paper tackles the problem of disparate treatment in classification by comparing split and group-blind classifiers, introducing the benefit-of-splitting metric and proving efficient computation methods with sharp bounds, validated on synthetic and real-world datasets.

Disparate treatment occurs when a machine learning model yields different decisions for individuals based on a sensitive attribute (e.g., age, sex). In domains where prediction accuracy is paramount, it could potentially be acceptable to fit a model which exhibits disparate treatment. To evaluate the effect of disparate treatment, we compare the performance of split classifiers (i.e., classifiers trained and deployed separately on each group) with group-blind classifiers (i.e., classifiers which do not use a sensitive attribute). We introduce the benefit-of-splitting for quantifying the performance improvement by splitting classifiers. Computing the benefit-of-splitting directly from its definition could be intractable since it involves solving optimization problems over an infinite-dimensional functional space. Under different performance measures, we (i) prove an equivalent expression for the benefit-of-splitting which can be efficiently computed by solving small-scale convex programs; (ii) provide sharp upper and lower bounds for the benefit-of-splitting which reveal precise conditions where a group-blind classifier will always suffer from a non-trivial performance gap from the split classifiers. In the finite sample regime, splitting is not necessarily beneficial and we provide data-dependent bounds to understand this effect. Finally, we validate our theoretical results through numerical experiments on both synthetic and real-world datasets.

View on arXiv PDF

Similar