Breaking Fair Binary Classification with Optimal Flipping Attacks
This work addresses security risks in fair machine learning for practitioners, though it is incremental as it builds on prior studies of corruption effects.
The paper tackles the vulnerability of fair binary classification to data poisoning by determining the minimum corruption needed for a successful flipping attack, establishing tight bounds for unique risk minimizers and proposing an efficient attack algorithm.
Minimizing risk with fairness constraints is one of the popular approaches to learning a fair classifier. Recent works showed that this approach yields an unfair classifier if the training set is corrupted. In this work, we study the minimum amount of data corruption required for a successful flipping attack. First, we find lower/upper bounds on this quantity and show that these bounds are tight when the target model is the unique unconstrained risk minimizer. Second, we propose a computationally efficient data poisoning attack algorithm that can compromise the performance of fair learning algorithms.