How Does Attention Help? Insights from Random Matrices on Signal Recovery from Sequence Models

arXiv:2605.0682616.1

Predicted impact top 72% in ML · last 90 daysOriginality Incremental advance

AI Analysis

This work provides theoretical insights into how attention mechanisms aid signal recovery in sequence models, offering a principled understanding for practitioners designing attention-based architectures.

The paper derives exact spectral characterizations for sample covariance matrices from pooled sequence representations in high-dimensional regimes, revealing two BBP-type phase transitions for signal recovery. It shows optimal attention weights are given by the top eigenvector of the positional correlation matrix, and causal self-attention with harmonic weights improves signal recovery over mean pooling when early tokens carry more signal.

We study the spectral properties of sample covariance matrices constructed from pooled sequence representations, where token embeddings are drawn from a fixed two-class Gaussian mixture table and pooled via (fixed) attention weights. Working in the high-dimensional regime $d,V,N\to\infty$ with $d/V\toδ$ and $d/N\toγ$, we derive exact characterizations of the limiting eigenvalue distribution, outlier eigenvalues, and eigenvector alignment with the hidden signal. The bulk spectrum follows a non-Marchenko--Pastur law given by the free multiplicative convolution $κ(MP_δ\boxtimes MP_γ)$, reflecting the finite vocabulary structure. Signal recovery undergoes two successive BBP-type phase transitions characterized by the scalars: $δ,γ,α=w^{\top} R w$ and $κ=\|w\|^2$, where $w$ denotes the attention pooling weights and $R$ the positional correlation matrix. An aftermath of our analysis demonstrates that the optimal attention weights maximizing the signal-to-noise ratio $α/κ$ are given by the (normalized) top eigenvector of $R$, and we show (as a particular case of our analysis) that parameter-free causal self-attention with $τ/d$ score scaling yields deterministic harmonic weights that improve signal recovery over mean pooling whenever early tokens carry more signal. Extensive simulations confirm sharp agreement between theory and finite-dimensional experiments.

View on arXiv PDF

Similar