CLAINov 10, 2022

Impact of Adversarial Training on Robustness and Generalizability of Language Models

arXiv:2211.05523v3226 citationsh-index: 38
Originality Incremental advance
AI Analysis

This work addresses the robustness-generalization trade-off in language models, offering insights for NLP practitioners, but it is incremental as it builds on existing adversarial training techniques.

The study compared adversarial training methods for language models, finding that pre-training data augmentation and input space perturbations improve robustness, while embedding space perturbations enhance generalization due to more specialized neurons.

Adversarial training is widely acknowledged as the most effective defense against adversarial attacks. However, it is also well established that achieving both robustness and generalization in adversarially trained models involves a trade-off. The goal of this work is to provide an in depth comparison of different approaches for adversarial training in language models. Specifically, we study the effect of pre-training data augmentation as well as training time input perturbations vs. embedding space perturbations on the robustness and generalization of transformer-based language models. Our findings suggest that better robustness can be achieved by pre-training data augmentation or by training with input space perturbation. However, training with embedding space perturbation significantly improves generalization. A linguistic correlation analysis of neurons of the learned models reveals that the improved generalization is due to 'more specialized' neurons. To the best of our knowledge, this is the first work to carry out a deep qualitative analysis of different methods of generating adversarial examples in adversarial training of language models.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes