CLFeb 27, 2024

Extreme Miscalibration and the Illusion of Adversarial Robustness

arXiv:2402.17509v329 citationsh-index: 61ACL
Originality Incremental advance
AI Analysis

This work addresses a critical issue for the NLP community by exposing false robustness claims and providing a method to ensure genuine evaluations, though it is incremental in improving robustness techniques.

The paper tackles the problem of misleading robustness gains in NLP models from miscalibration, showing that apparent adversarial robustness is an illusion that can be nullified by test-time temperature calibration, with the adversary able to find adversarial examples after calibration.

Deep learning-based Natural Language Processing (NLP) models are vulnerable to adversarial attacks, where small perturbations can cause a model to misclassify. Adversarial Training (AT) is often used to increase model robustness. However, we have discovered an intriguing phenomenon: deliberately or accidentally miscalibrating models masks gradients in a way that interferes with adversarial attack search methods, giving rise to an apparent increase in robustness. We show that this observed gain in robustness is an illusion of robustness (IOR), and demonstrate how an adversary can perform various forms of test-time temperature calibration to nullify the aforementioned interference and allow the adversarial attack to find adversarial examples. Hence, we urge the NLP community to incorporate test-time temperature scaling into their robustness evaluations to ensure that any observed gains are genuine. Finally, we show how the temperature can be scaled during \textit{training} to improve genuine robustness.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes