From Hero to Zéroe: A Benchmark of Low-Level Adversarial Attacks
This addresses the need for more realistic robustness testing in NLP models for applications like social media, though it is incremental in shifting focus from high-level to low-level attacks.
The paper tackles the problem of adversarial attacks in NLP by focusing on low-level character-level attacks, which are argued to be more realistic than high-level paraphrasing, and introduces Zéroe, a large-scale benchmark with nine attack modes, showing that RoBERTa fails on these attacks.
Adversarial attacks are label-preserving modifications to inputs of machine learning classifiers designed to fool machines but not humans. Natural Language Processing (NLP) has mostly focused on high-level attack scenarios such as paraphrasing input texts. We argue that these are less realistic in typical application scenarios such as in social media, and instead focus on low-level attacks on the character-level. Guided by human cognitive abilities and human robustness, we propose the first large-scale catalogue and benchmark of low-level adversarial attacks, which we dub Zéroe, encompassing nine different attack modes including visual and phonetic adversaries. We show that RoBERTa, NLP's current workhorse, fails on our attacks. Our dataset provides a benchmark for testing robustness of future more human-like NLP models.