How Worst-Case Are Adversarial Attacks? Linking Adversarial and Statistical Robustness
This work addresses the validity of adversarial attacks as proxies for robustness in safety-oriented evaluation for machine learning practitioners, though it is incremental in refining existing evaluation frameworks.
The paper investigates whether adversarial attacks are representative of model robustness to random noise or reflect atypical worst-case events, introducing a probabilistic metric to quantify noisy risk across perturbation distributions. Experiments on ImageNet and CIFAR-10 benchmark attacks, showing when adversarial success correlates with noisy risk and when it fails, with specific numerical results indicating conditions under which attacks overestimate or underestimate robustness.
Adversarial attacks are widely used to evaluate model robustness, yet their validity as proxies for robustness to random perturbations remains debated. We ask whether an adversarial perturbation provides a representative estimate of robustness under random noise of the same magnitude, or instead reflects an atypical worst-case event. To this end, we introduce a probabilistic metric that quantifies noisy risk with respect to directionally biased perturbation distributions, parameterized by a concentration factor $κ$ that interpolates between isotropic noise and adversarial direction. Using this framework, we study the limits of adversarial perturbations as estimators of noisy risk by proposing an attack strategy designed to operate in regimes statistically closer to uniform noise. Experiments on ImageNet and CIFAR-10 systematically benchmark widely used attacks, highlighting when adversarial success meaningfully reflects noisy risk and when it fails, thereby informing their use in safety-oriented evaluation.