LGAICLMLMay 7, 2024

Revisiting Character-level Adversarial Attacks for Language Models

arXiv:2405.04346v26 citationsh-index: 61Has CodeICML
Originality Incremental advance
AI Analysis

This addresses the challenge of generating effective adversarial examples for NLP models while preserving semantics, though it is incremental as it builds on existing attack paradigms.

The paper tackles the problem of character-level adversarial attacks on language models, which have been overlooked due to perceived limitations, and introduces Charmer, achieving a 4.84% point improvement in attack success rate and 8% point improvement in similarity on BERT with SST-2 compared to prior methods.

Adversarial attacks in Natural Language Processing apply perturbations in the character or token levels. Token-level attacks, gaining prominence for their use of gradient-based methods, are susceptible to altering sentence semantics, leading to invalid adversarial examples. While character-level attacks easily maintain semantics, they have received less attention as they cannot easily adopt popular gradient-based methods, and are thought to be easy to defend. Challenging these beliefs, we introduce Charmer, an efficient query-based adversarial attack capable of achieving high attack success rate (ASR) while generating highly similar adversarial examples. Our method successfully targets both small (BERT) and large (Llama 2) models. Specifically, on BERT with SST-2, Charmer improves the ASR in 4.84% points and the USE similarity in 8% points with respect to the previous art. Our implementation is available in https://github.com/LIONS-EPFL/Charmer.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes