Learning to Ignore Adversarial Attacks
This addresses the problem of adversarial vulnerability in NLP models for practitioners, offering a novel method that is not merely incremental.
The paper tackles the brittleness of NLP models against adversarial attacks by introducing rationale models that learn to ignore attack tokens, achieving over 90% token ignore rate and consistent robustness improvements of ~10% over baselines on three datasets for BERT and RoBERTa.
Despite the strong performance of current NLP models, they can be brittle against adversarial attacks. To enable effective learning against adversarial inputs, we introduce the use of rationale models that can explicitly learn to ignore attack tokens. We find that the rationale models can successfully ignore over 90% of attack tokens. This approach leads to consistent sizable improvements ($\sim$10%) over baseline models in robustness on three datasets for both BERT and RoBERTa, and also reliably outperforms data augmentation with adversarial examples alone. In many cases, we find that our method is able to close the gap between model performance on a clean test set and an attacked test set and hence reduce the effect of adversarial attacks.