LGFeb 25, 2021

SparseBERT: Rethinking the Importance Analysis in Self-attention

arXiv:2102.12871v364 citations
AI Analysis

This work addresses the efficiency and interpretability of Transformer models in NLP, offering incremental improvements by optimizing attention sparsity based on empirical analysis.

The paper tackles the problem of understanding self-attention in Transformers by analyzing the importance of different positions in the attention matrix during pre-training, finding that diagonal elements are the least important and can be removed without performance loss, and proposes a Differentiable Attention Mask algorithm to guide SparseBERT design, with experiments verifying these findings.

Transformer-based models are popularly used in natural language processing (NLP). Its core component, self-attention, has aroused widespread interest. To understand the self-attention mechanism, a direct method is to visualize the attention map of a pre-trained model. Based on the patterns observed, a series of efficient Transformers with different sparse attention masks have been proposed. From a theoretical perspective, universal approximability of Transformer-based models is also recently proved. However, the above understanding and analysis of self-attention is based on a pre-trained model. To rethink the importance analysis in self-attention, we study the significance of different positions in attention matrix during pre-training. A surprising result is that diagonal elements in the attention map are the least important compared with other attention positions. We provide a proof showing that these diagonal elements can indeed be removed without deteriorating model performance. Furthermore, we propose a Differentiable Attention Mask (DAM) algorithm, which further guides the design of the SparseBERT. Extensive experiments verify our interesting findings and illustrate the effect of the proposed algorithm.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes