On Adversarial Examples for Character-Level Neural Machine Translation
This work addresses robustness evaluation for NLP models, specifically in machine translation, but is incremental as it builds on existing adversarial example research.
The paper tackles the problem of evaluating robustness in character-level neural machine translation by introducing a novel white-box adversarial attack using differentiable string-edit operations, which is shown to be significantly stronger than black-box methods, revealing more serious vulnerabilities, and adversarial training improves robustness with only a 3x increase in training time.
Evaluating on adversarial examples has become a standard procedure to measure robustness of deep learning models. Due to the difficulty of creating white-box adversarial examples for discrete text input, most analyses of the robustness of NLP models have been done through black-box adversarial examples. We investigate adversarial examples for character-level neural machine translation (NMT), and contrast black-box adversaries with a novel white-box adversary, which employs differentiable string-edit operations to rank adversarial changes. We propose two novel types of attacks which aim to remove or change a word in a translation, rather than simply break the NMT. We demonstrate that white-box adversarial examples are significantly stronger than their black-box counterparts in different attack scenarios, which show more serious vulnerabilities than previously known. In addition, after performing adversarial training, which takes only 3 times longer than regular training, we can improve the model's robustness significantly.