CLLGDec 19, 2017

HotFlip: White-Box Adversarial Examples for Text Classification

arXiv:1712.06751v21553 citations
AI Analysis

This work addresses the vulnerability of neural text classifiers to adversarial attacks, which is an incremental advance in adversarial machine learning for natural language processing.

The authors tackled the problem of generating white-box adversarial examples for text classification by proposing an efficient method based on atomic flip operations using gradients, which greatly decreased classifier accuracy with only a few manipulations and enabled adversarial training to improve model robustness.

We propose an efficient method to generate white-box adversarial examples to trick a character-level neural classifier. We find that only a few manipulations are needed to greatly decrease the accuracy. Our method relies on an atomic flip operation, which swaps one token for another, based on the gradients of the one-hot input vectors. Due to efficiency of our method, we can perform adversarial training which makes the model more robust to attacks at test time. With the use of a few semantics-preserving constraints, we demonstrate that HotFlip can be adapted to attack a word-level classifier as well.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes