Robust Adversarial Classification via Abstaining
This addresses adversarial robustness for classification systems, but it is incremental as it builds on existing hypothesis testing frameworks.
The paper tackles the problem of adversarial robustness in binary classification by introducing an abstain option for low-confidence predictions, and shows that there is a fundamental tradeoff between nominal performance and robustness, with validation on the MNIST dataset.
In this work, we consider a binary classification problem and cast it into a binary hypothesis testing framework, where the observations can be perturbed by an adversary. To improve the adversarial robustness of a classifier, we include an abstain option, where the classifier abstains from making a decision when it has low confidence about the prediction. We propose metrics to quantify the nominal performance of a classifier with an abstain option and its robustness against adversarial perturbations. We show that there exist a tradeoff between the two metrics regardless of what method is used to choose the abstain region. Our results imply that the robustness of a classifier with an abstain option can only be improved at the expense of its nominal performance. Further, we provide necessary conditions to design the abstain region for a 1- dimensional binary classification problem. We validate our theoretical results on the MNIST dataset, where we numerically show that the tradeoff between performance and robustness also exist for the general multi-class classification problems.