LGAIMay 29

Reachability and asymptotics of Gaussian Transformer dynamics

arXiv:2606.07600h-index: 6
Originality Incremental advance
AI Analysis

For the theoretical understanding of Transformers, this work provides a novel mathematical framework linking them to control theory and Riccati equations, though the Gaussian invariance assumption limits immediate practical impact.

The paper proves that Gaussian distributions remain Gaussian through Transformer layers, reducing the dynamics to a finite-dimensional bilinear control system. It shows exact reachability of target Gaussian covariances with the same rank and identifies conditions for stability or blow-up, with numerical experiments confirming the theory.

We formulate data propagation through the Transformer, the machine learning architecture powering large language models, as a nonlinear control system on the space of probability measures. For the mean-field Transformer model with self-attention and affine feed-forward layers, we prove that Gaussian distributions remain exactly Gaussian along the induced flow. This invariance reduces the infinite-dimensional measure dynamics to a finite-dimensional bilinear control system governing the evolution of the mean and covariance, reformulates the expressive capacity of Transformers as a reachability problem for prescribed Gaussian moments, and reveals a novel connection with Riccati-type equations from classical filtering and control. For time-varying controls, we prove exact finite-time reachability of any target Gaussian distribution whose covariance matrix has the same rank as the initial one, this rank constraint being an intrinsic invariant of the dynamics. For time-invariant parameters, we derive explicit spectral conditions leading either to asymptotic stability toward positive-definite equilibria or to finite-time blow-up of the covariance. Numerical experiments complement the theory by showing that practical Transformers with Gaussian inputs remain close to moment-matched Gaussian distributions through early and intermediate layers, while Transformers with prescribed attention matrices reproduce the predicted covariance regimes: bounded evolution in stabilizing configurations and blow-up in destabilizing ones.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes