A Spectral Perspective towards Understanding and Improving Adversarial Robustness
This work addresses the critical issue of adversarial robustness in deep learning, offering an incremental improvement in defense mechanisms for security-sensitive applications.
The paper tackles the problem of deep neural networks' vulnerability to adversarial attacks by investigating adversarial training from a spectral perspective, showing it focuses on low-frequency regions for robustness and proposing a spectral alignment regularization that improves robust accuracy by 1.14% to 3.87% relative to standard adversarial training across multiple datasets and attacks.
Deep neural networks (DNNs) are incredibly vulnerable to crafted, imperceptible adversarial perturbations. While adversarial training (AT) has proven to be an effective defense approach, the AT mechanism for robustness improvement is not fully understood. This work investigates AT from a spectral perspective, adding new insights to the design of effective defenses. In particular, we show that AT induces the deep model to focus more on the low-frequency region, which retains the shape-biased representations, to gain robustness. Further, we find that the spectrum of a white-box attack is primarily distributed in regions the model focuses on, and the perturbation attacks the spectral bands where the model is vulnerable. Based on this observation, to train a model tolerant to frequency-varying perturbation, we propose a spectral alignment regularization (SAR) such that the spectral output inferred by an attacked adversarial input stays as close as possible to its natural input counterpart. Experiments demonstrate that SAR and its weight averaging (WA) extension could significantly improve the robust accuracy by 1.14% ~ 3.87% relative to the standard AT, across multiple datasets (CIFAR-10, CIFAR-100 and Tiny ImageNet), and various attacks (PGD, C&W and Autoattack), without any extra data.