CLSep 15, 2021

Incorporating Residual and Normalization Layers into Analysis of Masked Language Models

arXiv:2109.07152v1671 citations
Originality Synthesis-oriented
AI Analysis

This provides new intuitive explanations for interpreting Transformer-based masked language models, though it is incremental in scope.

The study extended Transformer analysis beyond attention patterns to include residual and normalization layers, finding that token-to-token interactions via attention have less impact on intermediate representations than assumed, with discarding learned patterns often not harming performance.

Transformer architecture has become ubiquitous in the natural language processing field. To interpret the Transformer-based models, their attention patterns have been extensively analyzed. However, the Transformer architecture is not only composed of the multi-head attention; other components can also contribute to Transformers' progressive performance. In this study, we extended the scope of the analysis of Transformers from solely the attention patterns to the whole attention block, i.e., multi-head attention, residual connection, and layer normalization. Our analysis of Transformer-based masked language models shows that the token-to-token interaction performed via attention has less impact on the intermediate representations than previously assumed. These results provide new intuitive explanations of existing reports; for example, discarding the learned attention patterns tends not to adversely affect the performance. The codes of our experiments are publicly available.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes