CLFeb 26, 2024

Unveiling Vulnerability of Self-Attention

arXiv:2402.16470v181 citationsh-index: 9Has CodeLREC

Originality Incremental advance

AI Analysis

This addresses security risks in real-world NLP systems by exposing and mitigating vulnerabilities in transformer-based models, representing an incremental advance in adversarial robustness.

The paper tackles the vulnerability of pre-trained language models to minor structural perturbations in self-attention mechanisms, showing that a 1% perturbation can achieve a 98% attack success rate, and proposes a smoothing technique that matches adversarial training in robustness.

Pre-trained language models (PLMs) are shown to be vulnerable to minor word changes, which poses a big threat to real-world systems. While previous studies directly focus on manipulating word inputs, they are limited by their means of generating adversarial samples, lacking generalization to versatile real-world attack. This paper studies the basic structure of transformer-based PLMs, the self-attention (SA) mechanism. (1) We propose a powerful perturbation technique \textit{HackAttend}, which perturbs the attention scores within the SA matrices via meticulously crafted attention masks. We show that state-of-the-art PLMs fall into heavy vulnerability that minor attention perturbations $(1\%)$ can produce a very high attack success rate $(98\%)$. Our paper expands the conventional text attack of word perturbations to more general structural perturbations. (2) We introduce \textit{S-Attend}, a novel smoothing technique that effectively makes SA robust via structural perturbations. We empirically demonstrate that this simple yet effective technique achieves robust performance on par with adversarial training when facing various text attackers. Code is publicly available at \url{github.com/liongkj/HackAttend}.

View on arXiv PDF Code

Similar