Balanced Adversarial Training: Balancing Tradeoffs between Fickleness and Obstinacy in NLP Models
This addresses a critical tradeoff in adversarial robustness for NLP models, offering a more balanced defense against different attack types, though it is incremental as it builds on existing adversarial training methods.
The paper tackles the problem that standard adversarial training for NLP models reduces vulnerability to fickle adversarial examples but increases vulnerability to obstinate ones, and introduces Balanced Adversarial Training using contrastive learning to improve robustness against both types, achieving gains of up to 15% in robustness metrics on natural language inference and paraphrase identification tasks.
Traditional (fickle) adversarial examples involve finding a small perturbation that does not change an input's true label but confuses the classifier into outputting a different prediction. Conversely, obstinate adversarial examples occur when an adversary finds a small perturbation that preserves the classifier's prediction but changes the true label of an input. Adversarial training and certified robust training have shown some effectiveness in improving the robustness of machine learnt models to fickle adversarial examples. We show that standard adversarial training methods focused on reducing vulnerability to fickle adversarial examples may make a model more vulnerable to obstinate adversarial examples, with experiments for both natural language inference and paraphrase identification tasks. To counter this phenomenon, we introduce Balanced Adversarial Training, which incorporates contrastive learning to increase robustness against both fickle and obstinate adversarial examples.