CLApr 10, 2021

Not All Attention Is All You Need

arXiv:2104.04692v311 citations
AI Analysis

This addresses overfitting issues in large self-attention models for NLP practitioners, but it is incremental as it builds on existing dropout techniques.

The paper tackles overfitting in pre-trained language models (PrLMs) by proposing AttendOut, a novel dropout method for self-attention architectures, achieving stronger results in robust task-specific tuning across extensive NLP tasks.

Beyond the success story of pre-trained language models (PrLMs) in recent natural language processing, they are susceptible to over-fitting due to unusual large model size. To this end, dropout serves as a therapy. However, existing methods like random-based, knowledge-based and search-based dropout are more general but less effective onto self-attention based models, which are broadly chosen as the fundamental architecture of PrLMs. In this paper, we propose a novel dropout method named AttendOut to let self-attention empowered PrLMs capable of more robust task-specific tuning. We demonstrate that state-of-the-art models with elaborate training design may achieve much stronger results. We verify the universality of our approach on extensive natural language processing tasks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes