77.4LGMay 16
Propagation of Chaos in Contextual Flow MapsShi Chen, Zhengjiang Lin, Kaizhao Liu et al.
We develop a quantitative statistical theory of transformers in the large-context regime by adopting the abstraction of contextual flow maps (CFMs): dynamical systems that evolve a distinguished token in the presence of a contextual measure across a stack of attention blocks. Within this framework, the finite-context model approximates an idealized infinite-context system in which the contextual measure is replaced by its underlying population, so that the context length $n$ becomes a statistical resource. Exploiting the McKean--Vlasov structure of the dynamics and the classical machinery of propagation of chaos, we establish a forward bound controlling the deviation between the finite- and infinite-context CFMs uniformly along depth, and a backward bound controlling the deviation between the corresponding training trajectories uniformly across iterations of online gradient descent. Both bounds achieve the optimal Wasserstein rate $n^{-1/d}$ for general CFMs and parametric rate $n^{-1/2}$ for a restricted class of CFMs that includes transformers as a special case. The analysis rests on a new Eulerian adjoint formulation of the loss gradient and stability estimates for the resulting forward--adjoint system, both of which may be of independent interest.
74.1LGMay 8
Scaling Limits of Long-Context TransformersGiuseppe Bruno, Shi Chen, Zhengjiang Lin et al.
We study the long-context limit of softmax self-attention with a fixed query and a random context of $n$ i.i.d. keys on the sphere, viewing the inverse temperature $β_n$ as the scaling parameter that decides whether attention degenerates into uniform averaging or collapses onto the single closest key. We show that the critical scale at which selectivity emerges is determined by the local exponent of the distance-to-query distribution near zero rather than by global features of the context, and scales like $β_n^\ast \asymp n^{2/(d-1)}$ for uniform keys on $\mathbb{S}^{d-1}$. Furthermore, we characterize the limiting laws of the ordered attention weights and of the attention output across all regimes of $β_n$: a subcritical regime in which the output reduces to a local average around $q$ with explicit deterministic bias and Gaussian fluctuations; a critical regime in which a finite collection of nearest keys retains macroscopic mass without single-key collapse; and a supercritical regime in which all mass concentrates on the closest key. Of notable interest is the subcritical case with identity value matrix where the attention map approximately implements a backward heat equation.
LGApr 15, 2024
Convergence Analysis of Probability Flow ODE for Score-based Generative ModelsDaniel Zhengyu Huang, Jiaoyang Huang, Zhengjiang Lin
Score-based generative models have emerged as a powerful approach for sampling high-dimensional probability distributions. Despite their effectiveness, their theoretical underpinnings remain relatively underdeveloped. In this work, we study the convergence properties of deterministic samplers based on probability flow ODEs from both theoretical and numerical perspectives. Assuming access to $L^2$-accurate estimates of the score function, we prove the total variation between the target and the generated data distributions can be bounded above by $\mathcal{O}(d^{3/4}δ^{1/2})$ in the continuous time level, where $d$ denotes the data dimension and $δ$ represents the $L^2$-score matching error. For practical implementations using a $p$-th order Runge-Kutta integrator with step size $h$, we establish error bounds of $\mathcal{O}(d^{3/4}δ^{1/2} + d\cdot(dh)^p)$ at the discrete level. Finally, we present numerical studies on problems up to 128 dimensions to verify our theory.
LGApr 20, 2025
Quantitative Clustering in Mean-Field Transformer ModelsShi Chen, Zhengjiang Lin, Yury Polyanskiy et al.
The evolution of tokens through a deep transformer models can be modeled as an interacting particle system that has been shown to exhibit an asymptotic clustering behavior akin to the synchronization phenomenon in Kuramoto models. In this work, we investigate the long-time clustering of mean-field transformer models. More precisely, we establish exponential rates of contraction to a Dirac point mass for any suitably regular initialization under some assumptions on the parameters of transformer models, any suitably regular mean-field initialization synchronizes exponentially fast with some quantitative rates.
LGJun 16, 2025
Fast Convergence for High-Order ODE Solvers in Diffusion Probabilistic ModelsDaniel Zhengyu Huang, Jiaoyang Huang, Zhengjiang Lin
Diffusion probabilistic models generate samples by learning to reverse a noise-injection process that transforms data into noise. A key development is the reformulation of the reverse sampling process as a deterministic probability flow ordinary differential equation (ODE), which allows for efficient sampling using high-order numerical solvers. Unlike traditional time integrator analysis, the accuracy of this sampling procedure depends not only on numerical integration errors but also on the approximation quality and regularity of the learned score function, as well as their interaction. In this work, we present a rigorous convergence analysis of deterministic samplers derived from probability flow ODEs for general forward processes with arbitrary variance schedules. Specifically, we develop and analyze $p$-th order (exponential) Runge-Kutta schemes, under the practical assumption that the first and second derivatives of the learned score function are bounded. We prove that the total variation distance between the generated and target distributions can be bounded as \begin{align*} O\bigl(d^{\frac{7}{4}}\varepsilon_{\text{score}}^{\frac{1}{2}} +d(dH_{\max})^p\bigr), \end{align*} where $\varepsilon^2_{\text{score}}$ denotes the $L^2$ error in the score function approximation, $d$ is the data dimension, and $H_{\max}$ represents the maximum solver step size. Numerical experiments on benchmark datasets further confirm that the derivatives of the learned score function are bounded in practice.
LGJan 1, 2025
Residual connections provably mitigate oversmoothing in graph neural networksZiang Chen, Zhengjiang Lin, Shi Chen et al.
Graph neural networks (GNNs) have achieved remarkable empirical success in processing and representing graph-structured data across various domains. However, a significant challenge known as "oversmoothing" persists, where vertex features become nearly indistinguishable in deep GNNs, severely restricting their expressive power and practical utility. In this work, we analyze the asymptotic oversmoothing rates of deep GNNs with and without residual connections by deriving explicit convergence rates for a normalized vertex similarity measure. Our analytical framework is grounded in the multiplicative ergodic theorem. Furthermore, we demonstrate that adding residual connections effectively mitigates or prevents oversmoothing across several broad families of parameter distributions. The theoretical findings are strongly supported by numerical experiments.
LGOct 7, 2025
Critical attention scaling in long-context transformersShi Chen, Zhengjiang Lin, Yury Polyanskiy et al.
As large language models scale to longer contexts, attention layers suffer from a fundamental pathology: attention scores collapse toward uniformity as context length $n$ increases, causing tokens to cluster excessively, a phenomenon known as rank-collapse. While $\textit{attention scaling}$ effectively addresses this deficiency by rescaling attention scores with a polylogarithmic factor $β_n$, theoretical justification for this approach remains lacking. We analyze a simplified yet tractable model that magnifies the effect of attention scaling. In this model, attention exhibits a phase transition governed by the scaling factor $β_n$: insufficient scaling collapses all tokens to a single direction, while excessive scaling reduces attention to identity, thereby eliminating meaningful interactions between tokens. Our main result identifies the critical scaling $β_n \asymp \log n$ and provides a rigorous justification for attention scaling in YaRN and Qwen, clarifying why logarithmic scaling maintains sparse, content-adaptive attention at large context lengths.