Not All Attention Is All You Need
This addresses overfitting issues in large self-attention models for NLP practitioners, but it is incremental as it builds on existing dropout techniques.
The paper tackles overfitting in pre-trained language models (PrLMs) by proposing AttendOut, a novel dropout method for self-attention architectures, achieving stronger results in robust task-specific tuning across extensive NLP tasks.
Beyond the success story of pre-trained language models (PrLMs) in recent natural language processing, they are susceptible to over-fitting due to unusual large model size. To this end, dropout serves as a therapy. However, existing methods like random-based, knowledge-based and search-based dropout are more general but less effective onto self-attention based models, which are broadly chosen as the fundamental architecture of PrLMs. In this paper, we propose a novel dropout method named AttendOut to let self-attention empowered PrLMs capable of more robust task-specific tuning. We demonstrate that state-of-the-art models with elaborate training design may achieve much stronger results. We verify the universality of our approach on extensive natural language processing tasks.