CLAICRCYJul 24, 2023

Gradient-Based Word Substitution for Obstinate Adversarial Examples Generation in Language Models

arXiv:2307.12507v24 citationsh-index: 8Has Code
Originality Highly original
AI Analysis

This addresses the challenge of evaluating model robustness in NLP by providing a more effective and automated way to generate adversarial examples, though it is incremental as it builds on existing word substitution approaches.

The paper tackles the problem of generating obstinate adversarial examples in language models by introducing GradObstinate, a gradient-based word substitution method that automatically creates such examples without manual design, achieving a higher attack success rate compared to antonym-based methods and demonstrating transferability to models like GPT-3 and ChatGPT.

In this paper, we study the problem of generating obstinate (over-stability) adversarial examples by word substitution in NLP, where input text is meaningfully changed but the model's prediction does not, even though it should. Previous word substitution approaches have predominantly focused on manually designed antonym-based strategies for generating obstinate adversarial examples, which hinders its application as these strategies can only find a subset of obstinate adversarial examples and require human efforts. To address this issue, in this paper, we introduce a novel word substitution method named GradObstinate, a gradient-based approach that automatically generates obstinate adversarial examples without any constraints on the search space or the need for manual design principles. To empirically evaluate the efficacy of GradObstinate, we conduct comprehensive experiments on five representative models (Electra, ALBERT, Roberta, DistillBERT, and CLIP) finetuned on four NLP benchmarks (SST-2, MRPC, SNLI, and SQuAD) and a language-grounding benchmark (MSCOCO). Extensive experiments show that our proposed GradObstinate generates more powerful obstinate adversarial examples, exhibiting a higher attack success rate compared to antonym-based methods. Furthermore, to show the transferability of obstinate word substitutions found by GradObstinate, we replace the words in four representative NLP benchmarks with their obstinate substitutions. Notably, obstinate substitutions exhibit a high success rate when transferred to other models in black-box settings, including even GPT-3 and ChatGPT. Examples of obstinate adversarial examples found by GradObstinate are available at https://huggingface.co/spaces/anonauthors/SecretLanguage.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes