Interpretability-Guided Test-Time Adversarial Defense
This work addresses the challenge of adversarial robustness for machine learning practitioners by providing an efficient and effective defense method, though it is incremental as it builds on existing test-time defense approaches.
The paper tackles the problem of adversarial attacks on neural networks by proposing a low-cost, training-free test-time defense that uses interpretability-guided neuron importance ranking to improve robustness-accuracy tradeoffs with minimal computational overhead. It demonstrates efficacy on CIFAR10, CIFAR100, and ImageNet-1k with average gains of 2.6%, 4.9%, and 2.8% respectively, and shows improvements of 1.5% over state-of-the-art defenses under adaptive attacks.
We propose a novel and low-cost test-time adversarial defense by devising interpretability-guided neuron importance ranking methods to identify neurons important to the output classes. Our method is a training-free approach that can significantly improve the robustness-accuracy tradeoff while incurring minimal computational overhead. While being among the most efficient test-time defenses (4x faster), our method is also robust to a wide range of black-box, white-box, and adaptive attacks that break previous test-time defenses. We demonstrate the efficacy of our method for CIFAR10, CIFAR100, and ImageNet-1k on the standard RobustBench benchmark (with average gains of 2.6%, 4.9%, and 2.8% respectively). We also show improvements (average 1.5%) over the state-of-the-art test-time defenses even under strong adaptive attacks.