Multi-Group Fairness Evaluation via Conditional Value-at-Risk Testing
This addresses the challenge of efficiently assessing fairness in ML models for high-dimensional sensitive attributes, which is crucial for applications like hiring or lending, though it is incremental as it builds on existing fairness evaluation methods.
The paper tackles the problem of evaluating performance disparities of machine learning models across groups defined by multiple sensitive attributes, where sample complexity grows exponentially with the number of attributes. It proposes a Conditional Value-at-Risk (CVaR) testing approach that reduces sample complexity to at most the square root of the number of groups by allowing small probabilistic slack, and also identifies a non-i.i.d. data collection strategy that achieves sample complexity independent of the number of groups.
Machine learning (ML) models used in prediction and classification tasks may display performance disparities across population groups determined by sensitive attributes (e.g., race, sex, age). We consider the problem of evaluating the performance of a fixed ML model across population groups defined by multiple sensitive attributes (e.g., race and sex and age). Here, the sample complexity for estimating the worst-case performance gap across groups (e.g., the largest difference in error rates) increases exponentially with the number of group-denoting sensitive attributes. To address this issue, we propose an approach to test for performance disparities based on Conditional Value-at-Risk (CVaR). By allowing a small probabilistic slack on the groups over which a model has approximately equal performance, we show that the sample complexity required for discovering performance violations is reduced exponentially to be at most upper bounded by the square root of the number of groups. As a byproduct of our analysis, when the groups are weighted by a specific prior distribution, we show that Rényi entropy of order 2/3 of the prior distribution captures the sample complexity of the proposed CVaR test algorithm. Finally, we also show that there exists a non-i.i.d. data collection strategy that results in a sample complexity independent of the number of groups.