CLOct 31, 2022

Character-level White-Box Adversarial Attacks against Transformers via Attachable Subwords Substitution

Tsinghua
arXiv:2210.17004v1305 citationsh-index: 26
Originality Incremental advance
AI Analysis

This work addresses a security vulnerability in transformer models, offering a novel attack method that is incremental in improving adversarial example generation.

The authors tackled the problem of generating character-level white-box adversarial attacks against transformer models by substituting attachable subwords, achieving higher success rates and lower edit distances compared to previous methods.

We propose the first character-level white-box adversarial attack method against transformer models. The intuition of our method comes from the observation that words are split into subtokens before being fed into the transformer models and the substitution between two close subtokens has a similar effect to the character modification. Our method mainly contains three steps. First, a gradient-based method is adopted to find the most vulnerable words in the sentence. Then we split the selected words into subtokens to replace the origin tokenization result from the transformer tokenizer. Finally, we utilize an adversarial loss to guide the substitution of attachable subtokens in which the Gumbel-softmax trick is introduced to ensure gradient propagation. Meanwhile, we introduce the visual and length constraint in the optimization process to achieve minimum character modifications. Extensive experiments on both sentence-level and token-level tasks demonstrate that our method could outperform the previous attack methods in terms of success rate and edit distance. Furthermore, human evaluation verifies our adversarial examples could preserve their origin labels.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes