LGMATH-PHDSPRDec 1, 2025

The Mean-Field Dynamics of Transformers

arXiv:2512.01868v119 citationsh-index: 6
Originality Highly original
AI Analysis

This provides foundational insights into representation collapse and expressive structure in deep attention architectures, which is crucial for advancing AI and ML theory.

The authors tackled the problem of understanding Transformer attention dynamics by modeling it as an interacting particle system and analyzing its mean-field limits, revealing a global clustering phenomenon where tokens asymptotically cluster after metastable states, with exact rates derived and phase transitions identified for long-context attention.

We develop a mathematical framework that interprets Transformer attention as an interacting particle system and studies its continuum (mean-field) limits. By idealizing attention continuous on the sphere, we connect Transformer dynamics to Wasserstein gradient flows, synchronization models (Kuramoto), and mean-shift clustering. Central to our results is a global clustering phenomenon whereby tokens cluster asymptotically after long metastable states where they are arranged into multiple clusters. We further analyze a tractable equiangular reduction to obtain exact clustering rates, show how commonly used normalization schemes alter contraction speeds, and identify a phase transition for long-context attention. The results highlight both the mechanisms that drive representation collapse and the regimes that preserve expressive, multi-cluster structure in deep attention architectures.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes