CL CR LGJan 21, 2022

Identifying Adversarial Attacks on Text Classifiers

Zhouhang Xie, Jonathan Brophy, Adam Noack, Wencong You, Kalyani Asthana, Carter Perkins, Sabrina Reis, Sameer Singh, Daniel Lowd

arXiv:2201.08555v10.811 citations

Originality Incremental advance

AI Analysis

This work addresses the need for forensic analysis of adversarial attacks on text classifiers, though it is incremental as it builds on existing attack detection methods.

The paper tackles the problem of identifying which adversarial attacks were used to manipulate text classifiers by creating a dataset of 1.5 million attack instances and developing classifiers to detect and label these attacks, achieving effectiveness with features like text properties and model activations.

The landscape of adversarial attacks against text classifiers continues to grow, with new attacks developed every year and many of them available in standard toolkits, such as TextAttack and OpenAttack. In response, there is a growing body of work on robust learning, which reduces vulnerability to these attacks, though sometimes at a high cost in compute time or accuracy. In this paper, we take an alternate approach -- we attempt to understand the attacker by analyzing adversarial text to determine which methods were used to create it. Our first contribution is an extensive dataset for attack detection and labeling: 1.5~million attack instances, generated by twelve adversarial attacks targeting three classifiers trained on six source datasets for sentiment analysis and abuse detection in English. As our second contribution, we use this dataset to develop and benchmark a number of classifiers for attack identification -- determining if a given text has been adversarially manipulated and by which attack. As a third contribution, we demonstrate the effectiveness of three classes of features for these tasks: text properties, capturing content and presentation of text; language model properties, determining which tokens are more or less probable throughout the input; and target model properties, representing how the text classifier is influenced by the attack, including internal node activations. Overall, this represents a first step towards forensics for adversarial attacks against text classifiers.

View on arXiv PDF

Similar