LGMar 5, 2021

Attention is Not All You Need: Pure Attention Loses Rank Doubly Exponentially with Depth

arXiv:2103.03404v2542 citations
AI Analysis

This addresses a foundational issue for researchers in machine learning by revealing a critical limitation in transformer architectures, though it is incremental as it builds on existing theoretical analyses.

The paper tackled the problem of understanding why self-attention networks work by proving that pure attention layers, without skip connections or MLPs, cause the output to converge doubly exponentially to a rank-1 matrix, leading to token uniformity, while experiments confirmed this convergence in transformer variants.

Attention-based architectures have become ubiquitous in machine learning, yet our understanding of the reasons for their effectiveness remains limited. This work proposes a new way to understand self-attention networks: we show that their output can be decomposed into a sum of smaller terms, each involving the operation of a sequence of attention heads across layers. Using this decomposition, we prove that self-attention possesses a strong inductive bias towards "token uniformity". Specifically, without skip connections or multi-layer perceptrons (MLPs), the output converges doubly exponentially to a rank-1 matrix. On the other hand, skip connections and MLPs stop the output from degeneration. Our experiments verify the identified convergence phenomena on different variants of standard transformer architectures.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes