Are Perceptually-Aligned Gradients a General Property of Robust Classifiers?
This addresses the problem of understanding the link between adversarial robustness and interpretability in machine learning for researchers, though it is incremental as it extends prior findings to another method.
The paper investigates whether perceptually-aligned gradients, where input optimization produces images resembling target classes, are a general property of robust classifiers, showing they also occur under randomized smoothing, an alternative adversarial robustness method.
For a standard convolutional neural network, optimizing over the input pixels to maximize the score of some target class will generally produce a grainy-looking version of the original image. However, Santurkar et al. (2019) demonstrated that for adversarially-trained neural networks, this optimization produces images that uncannily resemble the target class. In this paper, we show that these "perceptually-aligned gradients" also occur under randomized smoothing, an alternative means of constructing adversarially-robust classifiers. Our finding supports the hypothesis that perceptually-aligned gradients may be a general property of robust classifiers. We hope that our results will inspire research aimed at explaining this link between perceptually-aligned gradients and adversarial robustness.