Training Infinitely Deep and Wide Transformers

Raphaël Barboni, Maarten V. de Hoop, Takashi Furuya, Gabriel Peyré

arXiv:2605.1766070.8

Predicted impact top 7% in OC · last 90 daysOriginality Highly original

AI Analysis

Provides a theoretical foundation for transformer training dynamics, addressing a gap in understanding for practitioners and theorists.

This paper develops a rigorous mean-field framework for analyzing gradient-based training of infinitely deep and wide transformers, establishing well-posedness, gradient flow convergence, and NTK injectivity conditions. Under NTK injectivity, gradient flow converges to global minima when initial loss is small.

Transformers have become the dominant architecture in modern machine learning, yet the theoretical understanding of their training dynamics remains limited. This paper develops a rigorous mathematical framework for analyzing gradient-based training of transformers in the mean-field regime, where both the depth (number of layers) and width (number of attention heads) tend to infinity. While ResNet training can be understood as controlling a neural ODE, transformer training corresponds to controlling a neural PDE, due to the coupling of multiple token distributions through the attention mechanism. Our mean-field model features two types of measure representations: token distributions evolving through layers and attention parameters at each layer. We establish well-posedness of the forward pass through infinitely deep transformers, characterizing token evolution via flow maps that satisfy ODEs in function spaces. Using adjoint sensitivity analysis, we derive an explicit formula for the conditional Wasserstein gradient of the training risk, involving adjoint variables governed by backward ODEs. We prove the existence and uniqueness of gradient flow curves in the conditional Wasserstein metric space, establishing a rigorous foundation for gradient-based transformer training. A key technical contribution is providing necessary and sufficient conditions for injectivity of the Neural Tangent Kernel (NTK) for attention mechanisms: we show that NTK injectivity is equivalent to linear independence of log-sum-exp functions modulo affine functions, a condition satisfied by diverse token distributions, including discrete distributions, uniform distributions, and Gaussian mixtures. Under this NTK injectivity assumption, we prove that gradient flow converges to global minima when the initial loss is sufficiently small, eliminating spurious local minima from the optimization landscape.

View on arXiv PDF

Similar