Two Coupled Rejection Metrics Can Tell Adversarial Examples Apart
This work addresses the challenge of safely deploying ML models by providing a complementary method to detect and reject adversarial examples, though it is incremental as it builds on existing adversarial training frameworks.
The paper tackles the problem of improving robustness in machine learning models against adversarial examples by introducing a rejection option based on two coupled metrics, confidence and rectified confidence, which can provably distinguish misclassified inputs. It demonstrates that this rectified rejection module enhances robustness on datasets like CIFAR-10 and CIFAR-100 under various attacks, with minimal extra computation.
Correctly classifying adversarial examples is an essential but challenging requirement for safely deploying machine learning models. As reported in RobustBench, even the state-of-the-art adversarially trained models struggle to exceed 67% robust test accuracy on CIFAR-10, which is far from practical. A complementary way towards robustness is to introduce a rejection option, allowing the model to not return predictions on uncertain inputs, where confidence is a commonly used certainty proxy. Along with this routine, we find that confidence and a rectified confidence (R-Con) can form two coupled rejection metrics, which could provably distinguish wrongly classified inputs from correctly classified ones. This intriguing property sheds light on using coupling strategies to better detect and reject adversarial examples. We evaluate our rectified rejection (RR) module on CIFAR-10, CIFAR-10-C, and CIFAR-100 under several attacks including adaptive ones, and demonstrate that the RR module is compatible with different adversarial training frameworks on improving robustness, with little extra computation. The code is available at https://github.com/P2333/Rectified-Rejection.