CL LGDec 19, 2017

HotFlip: White-Box Adversarial Examples for Text Classification

Javid Ebrahimi, Anyi Rao, Daniel Lowd, Dejing Dou

arXiv:1712.06751v244.71553 citations

Originality Highly original

AI Analysis

This work addresses the vulnerability of neural text classifiers to adversarial attacks, which is an incremental advance in adversarial machine learning for natural language processing.

The authors tackled the problem of generating white-box adversarial examples for text classification by proposing an efficient method based on atomic flip operations using gradients, which greatly decreased classifier accuracy with only a few manipulations and enabled adversarial training to improve model robustness.

We propose an efficient method to generate white-box adversarial examples to trick a character-level neural classifier. We find that only a few manipulations are needed to greatly decrease the accuracy. Our method relies on an atomic flip operation, which swaps one token for another, based on the gradients of the one-hot input vectors. Due to efficiency of our method, we can perform adversarial training which makes the model more robust to attacks at test time. With the use of a few semantics-preserving constraints, we demonstrate that HotFlip can be adapted to attack a word-level classifier as well.

View on arXiv PDF

Similar