Training Ensembles to Detect Adversarial Examples
This addresses the security vulnerability of neural networks to adversarial attacks, which is a critical issue for deploying AI in safety-sensitive applications, though it appears incremental as it builds on existing ensemble and detection techniques.
The paper tackled the problem of detecting adversarial examples in neural networks by proposing a new ensemble training method that reduces classification error on benign data while minimizing agreement on out-of-distribution examples, achieving improved detection rates against various attacks including DeepFool and C&W on MNIST and CIFAR-10 datasets.
We propose a new ensemble method for detecting and classifying adversarial examples generated by state-of-the-art attacks, including DeepFool and C&W. Our method works by training the members of an ensemble to have low classification error on random benign examples while simultaneously minimizing agreement on examples outside the training distribution. We evaluate on both MNIST and CIFAR-10, against oblivious and both white- and black-box adversaries.