Characterizing the Optimal 0-1 Loss for Multi-class Classification with a Test-time Attacker
This provides a diagnostic tool for evaluating the gap to optimal robustness in multi-class adversarial settings, which is critical for safe deployment of classifiers.
The paper tackles the problem of determining the best possible robustness for multi-class classifiers against test-time attackers by finding information-theoretic lower bounds on 0-1 loss for any discrete dataset, using a conflict hypergraph framework and variants to efficiently compute optimal loss ranges.
Finding classifiers robust to adversarial examples is critical for their safe deployment. Determining the robustness of the best possible classifier under a given threat model for a given data distribution and comparing it to that achieved by state-of-the-art training methods is thus an important diagnostic tool. In this paper, we find achievable information-theoretic lower bounds on loss in the presence of a test-time attacker for multi-class classifiers on any discrete dataset. We provide a general framework for finding the optimal 0-1 loss that revolves around the construction of a conflict hypergraph from the data and adversarial constraints. We further define other variants of the attacker-classifier game that determine the range of the optimal loss more efficiently than the full-fledged hypergraph construction. Our evaluation shows, for the first time, an analysis of the gap to optimal robustness for classifiers in the multi-class setting on benchmark datasets.