CL AISep 5, 2024

Bypassing DARCY Defense: Indistinguishable Universal Adversarial Triggers

Zuquan Peng, Yuanyuan He, Jianbing Ni, Ben Niu

arXiv:2409.03183v11.0h-index: 8

Originality Incremental advance

AI Analysis

This work addresses the vulnerability of NLP models and their defenses to adversarial attacks, showing that current detection methods like DARCY can be circumvented, which is incremental as it builds on existing UAT methods.

The paper tackles the problem of bypassing the DARCY defense against Universal Adversarial Triggers (UAT) in NLP models by introducing IndisUAT, a method that generates triggers making adversarial examples indistinguishable from benign ones, reducing DARCY's true positive rate by at least 40.8% to 90.6% and dropping model accuracy by at least 33.3% to 51.6%.

Neural networks (NN) classification models for Natural Language Processing (NLP) are vulnerable to the Universal Adversarial Triggers (UAT) attack that triggers a model to produce a specific prediction for any input. DARCY borrows the "honeypot" concept to bait multiple trapdoors, effectively detecting the adversarial examples generated by UAT. Unfortunately, we find a new UAT generation method, called IndisUAT, which produces triggers (i.e., tokens) and uses them to craft adversarial examples whose feature distribution is indistinguishable from that of the benign examples in a randomly-chosen category at the detection layer of DARCY. The produced adversarial examples incur the maximal loss of predicting results in the DARCY-protected models. Meanwhile, the produced triggers are effective in black-box models for text generation, text inference, and reading comprehension. Finally, the evaluation results under NN models for NLP tasks indicate that the IndisUAT method can effectively circumvent DARCY and penetrate other defenses. For example, IndisUAT can reduce the true positive rate of DARCY's detection by at least 40.8% and 90.6%, and drop the accuracy by at least 33.3% and 51.6% in the RNN and CNN models, respectively. IndisUAT reduces the accuracy of the BERT's adversarial defense model by at least 34.0%, and makes the GPT-2 language model spew racist outputs even when conditioned on non-racial context.

View on arXiv PDF

Similar