MLCLLGFeb 5, 2024

Towards Understanding the Word Sensitivity of Attention Layers: A Study via Random Features

arXiv:2402.02969v26 citationsh-index: 25ICML
Originality Incremental advance
AI Analysis

This work provides theoretical insights into the fundamental properties of attention mechanisms in transformers, which is foundational for advancing NLP models, though it is incremental in building on existing analysis frameworks.

The paper tackled the problem of understanding why attention layers in transformers are effective for NLP tasks by studying their word sensitivity in a random features setting. It showed that attention layers have high word sensitivity, enabling them to distinguish sentences differing by a single word, unlike standard random features which decay with context length, and validated this with experiments on BERT-Base embeddings.

Understanding the reasons behind the exceptional success of transformers requires a better analysis of why attention layers are suitable for NLP tasks. In particular, such tasks require predictive models to capture contextual meaning which often depends on one or few words, even if the sentence is long. Our work studies this key property, dubbed word sensitivity (WS), in the prototypical setting of random features. We show that attention layers enjoy high WS, namely, there exists a vector in the space of embeddings that largely perturbs the random attention features map. The argument critically exploits the role of the softmax in the attention layer, highlighting its benefit compared to other activations (e.g., ReLU). In contrast, the WS of standard random features is of order $1/\sqrt{n}$, $n$ being the number of words in the textual sample, and thus it decays with the length of the context. We then translate these results on the word sensitivity into generalization bounds: due to their low WS, random features provably cannot learn to distinguish between two sentences that differ only in a single word; in contrast, due to their high WS, random attention features have higher generalization capabilities. We validate our theoretical results with experimental evidence over the BERT-Base word embeddings of the imdb review dataset.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes