APLGDSMay 11

Quantifying Concentration Phenomena of Mean-Field Transformers in the Low-Temperature Regime

arXiv:2605.1093113.71 citations
Predicted impact top 6% in AP · last 90 daysOriginality Incremental advance
AI Analysis

This provides a rigorous theoretical understanding of token evolution in transformers, relevant for researchers studying transformer dynamics and mean-field limits.

The paper proves that in deep encoder-only transformers at inference time, token distributions rapidly concentrate onto a limiting distribution in the low-temperature regime, with Wasserstein distance scaling as √(log(β+1)/β)exp(Ct)+exp(-ct). Numerical experiments confirm this concentration for time scales of order log β and reveal a terminal phase dominated by the value matrix spectrum.

Transformers with self-attention modules as their core components have become an integral architecture in modern large language and foundation models. In this paper, we study the evolution of tokens in deep encoder-only transformers at inference time which is described in the large-token limit by a mean-field continuity equation. Leveraging ideas from the convergence analysis of interacting multi-particle systems, with particles corresponding to tokens, we prove that the token distribution rapidly concentrates onto the push-forward of the initial distribution under a projection map induced by the key, query, and value matrices, and remains metastable for moderate times. Specifically, we show that the Wasserstein distance of the two distributions scales like $\sqrt{{\log(β+1)}/β}\exp(Ct)+\exp(-ct)$ in terms of the temperature parameter $β^{-1}\to 0$ and inference time $t\geq 0$. For the proof, we establish Lyapunov-type estimates for the zero-temperature equation, identify its limit as $t\to\infty$, and employ a stability estimate in Wasserstein space together with a quantitative Laplace principle to couple the two equations. Our result implies that for time scales of order $\logβ$ the token distribution concentrates at the identified limiting distribution. Numerical experiments confirm this and, beyond that, complement our theory by showing that for finite $β$ and large $t$ the dynamics enter a different terminal phase, dominated by the spectrum of the value matrix.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes