LGMLMay 25, 2025

Temperature is All You Need for Generalization in Langevin Dynamics and other Markov Processes

arXiv:2505.19087v23 citationsh-index: 17
AI Analysis

This provides a foundational guarantee for generalization in stochastic training algorithms, applicable broadly in machine learning, though it is incremental in extending thermodynamic principles to analysis.

The paper tackles the generalization gap in over-parametrized models trained with Markovian stochastic algorithms like Langevin dynamics, bounding it by √((β𝔼L(θ₀) + log(1/δ))/N) with probability 1-δ, independent of training time, mixing, dimensionality, or loss properties.

We analyze the generalization gap (gap between the training and test errors) when training a potentially over-parametrized model using a Markovian stochastic training algorithm, initialized from some distribution $θ_0 \sim p_0$. We focus on Langevin dynamics with a positive temperature $β^{-1}$, i.e. gradient descent on a training loss $L$ with infinitesimal step size, perturbed with $β^{-1}$-variances Gaussian noise, and lightly regularized or bounded. There, we bound the generalization gap, at any time during training, by $\sqrt{(β\mathbb{E} L (θ_0) + \log(1/δ))/N}$ with probability $1-δ$ over the dataset, where $N$ is the sample size, and $\mathbb{E} L (θ_0) =O(1)$ with standard initialization scaling. In contrast to previous guarantees, we have no dependence on either training time or reliance on mixing, nor a dependence on dimensionality, gradient norms, or any other properties of the loss or model. This guarantee follows from a general analysis of any Markov process-based training that has a Gibbs-style stationary distribution. The proof is surprisingly simple, once we observe that the marginal distribution divergence from initialization remains bounded, as implied by a generalized second law of thermodynamics.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes