CL LGOct 22, 2022

ADDMU: Detection of Far-Boundary Adversarial Examples with Data and Model Uncertainty Estimation

Fan Yin, Yao Li, Cho-Jui Hsieh, Kai-Wei Chang

arXiv:2210.12396v124.0293 citationsh-index: 84Has Code

Originality Incremental advance

AI Analysis

This work addresses the need for more robust adversarial example detection in NLP, though it is incremental as it builds on existing uncertainty estimation methods.

The paper tackles the problem of detecting adversarial examples in NLP by identifying a shortcut in existing methods that rely on near-boundary examples, and proposes ADDMU, a technique using data and model uncertainty estimation, which outperforms previous methods by 3.6 and 6.0 AUC points for regular and far-boundary scenarios.

Adversarial Examples Detection (AED) is a crucial defense technique against adversarial attacks and has drawn increasing attention from the Natural Language Processing (NLP) community. Despite the surge of new AED methods, our studies show that existing methods heavily rely on a shortcut to achieve good performance. In other words, current search-based adversarial attacks in NLP stop once model predictions change, and thus most adversarial examples generated by those attacks are located near model decision boundaries. To surpass this shortcut and fairly evaluate AED methods, we propose to test AED methods with \textbf{F}ar \textbf{B}oundary (\textbf{FB}) adversarial examples. Existing methods show worse than random guess performance under this scenario. To overcome this limitation, we propose a new technique, \textbf{ADDMU}, \textbf{a}dversary \textbf{d}etection with \textbf{d}ata and \textbf{m}odel \textbf{u}ncertainty, which combines two types of uncertainty estimation for both regular and FB adversarial example detection. Our new method outperforms previous methods by 3.6 and 6.0 \emph{AUC} points under each scenario. Finally, our analysis shows that the two types of uncertainty provided by \textbf{ADDMU} can be leveraged to characterize adversarial examples and identify the ones that contribute most to model's robustness in adversarial training.

View on arXiv PDF Code

Similar