CL LGMay 18

Universal Adversarial Triggers

Benedict Florance Arockiaraj, Alexander Feng, Jianxiong Cai, Xiaoyu Cheng

arXiv:2605.1793662.3

Predicted impact top 24% in CL · last 90 daysOriginality Incremental advance

AI Analysis

For NLP practitioners, this work improves the stealthiness of adversarial attacks and provides a defense method, though it is incremental over existing universal trigger approaches.

The paper proposes a method combining parts-of-speech filtering and perplexity-based loss to generate natural-looking universal adversarial triggers for NLP models. On the SST sentiment analysis dataset, the triggers achieve accuracies as low as 0.04 and 0.12 for flipping predictions, and adversarial training increases model accuracy from 0.12 to 0.48.

Recent works have illustrated that modern NLP models trained for diverse tasks ranging from sentiment analysis to language generation succumb to universal adversarial attacks, a class of input-agnostic attacks where a common trigger sequence is used to attack the model. Although these attacks are successful, the triggers generated by such attacks are ungrammatical and unnatural. Our work proposes a novel technique combining parts-of-speech filtering and perplexity based loss function to generate sensible triggers that are closer to natural phrases. For the task of sentiment analysis on the SST dataset, the method produces sensible triggers that achieve accuracies as low as 0.04 and 0.12 for flipping positive to negative predictions and vice-versa. To build robust models, we also perform adversarial training using the generated triggers that increases the accuracy of the model from 0.12 to 0.48. We aim to illustrate that adversarial attacks can be made difficult to detect by generating sensible triggers, and to facilitate robust model development through relevant defenses.

View on arXiv PDF

Similar