NAMay 4, 2012
Rational Construction of Stochastic Numerical Methods for Molecular SamplingBenedict Leimkuhler, Charles Matthews
In this article, we focus on the sampling of the configurational Gibbs-Boltzmann distribution, that is, the calculation of averages of functions of the position coordinates of a molecular $N$-body system modelled at constant temperature. We show how a formal series expansion of the invariant measure of a Langevin dynamics numerical method can be obtained in a straightforward way using the Baker-Campbell-Hausdorff lemma. We then compare Langevin dynamics integrators in terms of their invariant distributions and demonstrate a superconvergence property (4th order accuracy where only 2nd order would be expected) of one method in the high friction limit; this method, moreover, can be reduced to a simple modification of the Euler-Maruyama method for Brownian dynamics involving a non-Markovian (coloured noise) random process. In the Brownian dynamics case, 2nd order accuracy of the invariant density is achieved. All methods considered are efficient for molecular applications (requiring one force evaluation per timestep) and of a simple form. In fully resolved (long run) molecular dynamics simulations, for our favoured method, we observe up to two orders of magnitude improvement in configurational sampling accuracy for given stepsize with no evident reduction in the size of the largest usable timestep compared to common alternative methods.
CHEM-PHFeb 7, 2019
Simulated Tempering Method in the Infinite Switch Limit with Adaptive Weight LearningAnton Martinsson, Jianfeng Lu, Benedict Leimkuhler et al.
We investigate the theoretical foundations of the simulated tempering method and use our findings to design efficient algorithms. Employing a large deviation argument first used for replica exchange molecular dynamics [Plattner et al., J. Chem. Phys. 135:134111 (2011)], we demonstrate that the most efficient approach to simulated tempering is to vary the temperature infinitely rapidly. In this limit, we can replace the equations of motion for the temperature and physical variables by averaged equations for the latter alone, with the forces rescaled according to a position-dependent function defined in terms of temperature weights. The averaged equations are similar to those used in Gao's integrated-over-temperature method, except that we show that it is better to use a continuous rather than a discrete set of temperatures. We give a theoretical argument for the choice of the temperature weights as the reciprocal partition function, thereby relating simulated tempering to Wang-Landau sampling. Finally, we describe a self-consistent algorithm for simultaneously sampling the canonical ensemble and learning the weights during simulation. This algorithm is tested on a system of harmonic oscillators as well as a continuous variant of the Curie-Weiss model, where it is shown to perform well and to accurately capture the second-order phase transition observed in this model.
NAMar 6, 2016
Adaptive Thermostats for Noisy Gradient SystemsBenedict Leimkuhler, Xiaocheng Shang
We study numerical methods for sampling probability measures in high dimension where the underlying model is only approximately identified with a gradient system. Extended stochastic dynamical methods are discussed which have application to multiscale models, nonequilibrium molecular dynamics, and Bayesian sampling techniques arising in emerging machine learning applications. In addition to providing a more comprehensive discussion of the foundations of these methods, we propose a new numerical method for the adaptive Langevin/stochastic gradient Nosé--Hoover thermostat that achieves a dramatic improvement in numerical efficiency over the most popular stochastic gradient methods reported in the literature. We also demonstrate that the newly established method inherits a superconvergence property (fourth order convergence to the invariant measure for configurational quantities) recently demonstrated in the setting of Langevin dynamics. Our findings are verified by numerical experiments.
LGJan 30
Adaptive Momentum and Nonlinear Damping for Neural Network TrainingAikaterini Karoni, Rajit Rajpal, Benedict Leimkuhler et al.
We propose a continuous-time scheme for large-scale optimization that introduces individual, adaptive momentum coefficients regulated by the kinetic energy of each model parameter. This approach automatically adjusts to local landscape curvature to maintain stability without sacrificing convergence speed. We demonstrate that our adaptive friction can be related to cubic damping, a suppression mechanism from structural dynamics. Furthermore, we introduce two specific optimization schemes by augmenting the continuous dynamics of mSGD and Adam with a cubic damping term. Empirically, our methods demonstrate robustness and match or outperform Adam on training ViT, BERT, and GPT2 tasks where mSGD typically struggles. We further provide theoretical results establishing the exponential convergence of the proposed schemes.
LGNov 11, 2025
Adaptive Stepsizing for Stochastic Gradient Langevin Dynamics in Bayesian Neural NetworksRajit Rajpal, Benedict Leimkuhler, Yuanhao Jiang
Bayesian neural networks (BNNs) require scalable sampling algorithms to approximate posterior distributions over parameters. Existing stochastic gradient Markov Chain Monte Carlo (SGMCMC) methods are highly sensitive to the choice of stepsize and adaptive variants such as pSGLD typically fail to sample the correct invariant measure without addition of a costly divergence correction term. In this work, we build on the recently proposed `SamAdams' framework for timestep adaptation (Leimkuhler, Lohmann, and Whalley 2025), introducing an adaptive scheme: SA-SGLD, which employs time rescaling to modulate the stepsize according to a monitored quantity (typically the local gradient norm). SA-SGLD can automatically shrink stepsizes in regions of high curvature and expand them in flatter regions, improving both stability and mixing without introducing bias. We show that our method can achieve more accurate posterior sampling than SGLD on high-curvature 2D toy examples and in image classification with BNNs using sharp priors.
MLOct 14, 2024
Sampling from Bayesian Neural Network Posteriors with Symmetric Minibatch Splitting Langevin DynamicsDaniel Paulin, Peter A. Whalley, Neil K. Chada et al.
We propose a scalable kinetic Langevin dynamics algorithm for sampling parameter spaces of big data and AI applications. Our scheme combines a symmetric forward/backward sweep over minibatches with a symmetric discretization of Langevin dynamics. For a particular Langevin splitting method (UBU), we show that the resulting Symmetric Minibatch Splitting-UBU (SMS-UBU) integrator has bias $O(h^2 d^{1/2})$ in dimension $d>0$ with stepsize $h>0$, despite only using one minibatch per iteration, thus providing excellent control of the sampling bias as a function of the stepsize. We apply the algorithm to explore local modes of the posterior distribution of Bayesian neural networks (BNNs) and evaluate the calibration performance of the posterior predictive probabilities for neural networks with convolutional neural network architectures for classification problems on three different datasets (Fashion-MNIST, Celeb-A and chest X-ray). Our results indicate that BNNs sampled with SMS-UBU can offer significantly better calibration performance compared to standard methods of training and stochastic weight averaging.
COApr 26, 2025
A Langevin sampling algorithm inspired by the Adam optimizerBenedict Leimkuhler, René Lohmann, Peter Whalley
We present a framework for adaptive-stepsize MCMC sampling based on time-rescaled Langevin dynamics, in which the stepsize variation is dynamically driven by an additional degree of freedom. Our approach augments the phase space by an additional variable which in turn defines a time reparameterization. The use of an auxiliary relaxation equation allows accumulation of a moving average of a local monitor function and provides for precise control of the timestep while circumventing the need to modify the drift term in the physical system. Our algorithm is straightforward to implement and can be readily combined with any off-the-peg fixed-stepsize Langevin integrator. As a particular example, we consider control of the stepsize by monitoring the norm of the log-posterior gradient, which takes inspiration from the Adam optimizer, the stepsize being automatically reduced in regions of steep change of the log posterior and increased on plateaus, improving numerical stability and convergence speed. As in Adam, the stepsize variation depends on the recent history of the gradient norm, which enhances stability and improves accuracy compared to more immediate control approaches. We demonstrate the potential benefit of this method--both in accuracy and in stability--in numerical experiments including Neal's funnel and a Bayesian neural network for classification of MNIST data.
LGSep 24, 2025
How deep is your network? Deep vs. shallow learning of transfer operatorsMohammad Tabish, Benedict Leimkuhler, Stefan Klus
We propose a randomized neural network approach called RaNNDy for learning transfer operators and their spectral decompositions from data. The weights of the hidden layers of the neural network are randomly selected and only the output layer is trained. The main advantage is that without a noticeable reduction in accuracy, this approach significantly reduces the training time and resources while avoiding common problems associated with deep learning such as sensitivity to hyperparameters and slow convergence. Additionally, the proposed framework allows us to compute a closed-form solution for the output layer which directly represents the eigenfunctions of the operator. Moreover, it is possible to estimate uncertainties associated with the computed spectral properties via ensemble learning. We present results for different dynamical operators, including Koopman and Perron-Frobenius operators, which have important applications in analyzing the behavior of complex dynamical systems, and the Schrödinger operator. The numerical examples, which highlight the strengths but also weaknesses of the proposed framework, include several stochastic dynamical systems, protein folding processes, and the quantum harmonic oscillator.
LGJun 20, 2021
Multirate Training of Neural NetworksTiffany Vlaar, Benedict Leimkuhler
We propose multirate training of neural networks: partitioning neural network parameters into "fast" and "slow" parts which are trained on different time scales, where slow parts are updated less frequently. By choosing appropriate partitionings we can obtain substantial computational speed-up for transfer learning tasks. We show for applications in vision and NLP that we can fine-tune deep neural networks in almost half the time, without reducing the generalization performance of the resulting models. We analyze the convergence properties of our multirate scheme and draw a comparison with vanilla SGD. We also discuss splitting choices for the neural network parameters which could enhance generalization performance when neural networks are trained from scratch. A multirate approach can be used to learn different features present in the data and as a form of regularization. Our paper unlocks the potential of using multirate techniques for neural network training and provides several starting points for future work in this area.
LGJun 20, 2021
Better Training using Weight-Constrained Stochastic DynamicsBenedict Leimkuhler, Tiffany Vlaar, Timothée Pouchon et al.
We employ constraints to control the parameter space of deep neural networks throughout training. The use of customized, appropriately designed constraints can reduce the vanishing/exploding gradients problem, improve smoothness of classification boundaries, control weight magnitudes and stabilize deep neural networks, and thus enhance the robustness of training algorithms and the generalization capabilities of neural networks. We provide a general approach to efficiently incorporate constraints into a stochastic gradient Langevin framework, allowing enhanced exploration of the loss landscape. We also present specific examples of constrained training methods motivated by orthogonality preservation for weight matrices and explicit weight normalizations. Discretization schemes are provided both for the overdamped formulation of Langevin dynamics and the underdamped form, in which momenta further improve sampling efficiency. These optimization schemes can be used directly, without needing to adapt neural network architecture design choices or to modify the objective with regularization terms, and see performance improvements in classification tasks.
LGJun 17, 2020
Constraint-Based Regularization of Neural NetworksBenedict Leimkuhler, Timothée Pouchon, Tiffany Vlaar et al.
We propose a method for efficiently incorporating constraints into a stochastic gradient Langevin framework for the training of deep neural networks. Constraints allow direct control of the parameter space of the model. Appropriately designed, they reduce the vanishing/exploding gradient problem, control weight magnitudes and stabilize deep neural networks and thus improve the robustness of training algorithms and the generalization capabilities of the trained neural network. We present examples of constrained training methods motivated by orthogonality preservation for weight matrices and explicit weight normalizations. We describe the methods in the overdamped formulation of Langevin dynamics and the underdamped form, in which momenta help to improve sampling efficiency. The methods are explored in test examples in image classification and natural language processing.
LGAug 30, 2019
Partitioned integrators for thermodynamic parameterization of neural networksBenedict Leimkuhler, Charles Matthews, Tiffany Vlaar
Traditionally, neural networks are parameterized using optimization procedures such as stochastic gradient descent, RMSProp and ADAM. These procedures tend to drive the parameters of the network toward a local minimum. In this article, we employ alternative "sampling" algorithms (referred to here as "thermodynamic parameterization methods") which rely on discretized stochastic differential equations for a defined target distribution on parameter space. We show that the thermodynamic perspective already improves neural network training. Moreover, by partitioning the parameters based on natural layer structure we obtain schemes with very rapid convergence for data sets with complicated loss landscapes. We describe easy-to-implement hybrid partitioned numerical algorithms, based on discretized stochastic differential equations, which are adapted to feed-forward neural networks, including a multi-layer Langevin algorithm, AdLaLa (combining the adaptive Langevin and Langevin algorithms) and LOL (combining Langevin and Overdamped Langevin); we examine the convergence of these methods using numerical studies and compare their performance among themselves and in relation to standard alternatives such as stochastic gradient descent and ADAM. We present evidence that thermodynamic parameterization methods can be (i) faster, (ii) more accurate, and (iii) more robust than standard algorithms used within machine learning frameworks.
STMar 20, 2019
TATi-Thermodynamic Analytics ToolkIt: TensorFlow-based software for posterior sampling in machine learning applicationsFrederik Heber, Zofia Trstanova, Benedict Leimkuhler
With the advent of GPU-assisted hardware and maturing high-efficiency software platforms such as TensorFlow and PyTorch, Bayesian posterior sampling for neural networks becomes plausible. In this article we discuss Bayesian parametrization in machine learning based on Markov Chain Monte Carlo methods, specifically discretized stochastic differential equations such as Langevin dynamics and extended system methods in which an ensemble of walkers is employed to enhance sampling. We provide a glimpse of the potential of the sampling-intensive approach by studying (and visualizing) the loss landscape of a neural network applied to the MNIST data set. Moreover, we investigate how the sampling efficiency itself can be significantly enhanced through an ensemble quasi-Newton preconditioning method. This article accompanies the release of a new TensorFlow software package, the Thermodynamic Analytics ToolkIt, which is used in the computational experiments.
MEJul 13, 2016
Ensemble preconditioning for Markov chain Monte Carlo simulationCharles Matthews, Jonathan Weare, Benedict Leimkuhler
We describe parallel Markov chain Monte Carlo methods that propagate a collective ensemble of paths, with local covariance information calculated from neighboring replicas. The use of collective dynamics eliminates multiplicative noise and stabilizes the dynamics thus providing a practical approach to difficult anisotropic sampling problems in high dimensions. Numerical experiments with model problems demonstrate that dramatic potential speedups, compared to various alternative schemes, are attainable.
MLOct 29, 2015
Covariance-Controlled Adaptive Langevin Thermostat for Large-Scale Bayesian SamplingXiaocheng Shang, Zhanxing Zhu, Benedict Leimkuhler et al.
Monte Carlo sampling for Bayesian posterior inference is a common approach used in machine learning. The Markov Chain Monte Carlo procedures that are used are often discrete-time analogues of associated stochastic differential equations (SDEs). These SDEs are guaranteed to leave invariant the required posterior distribution. An area of current research addresses the computational benefits of stochastic gradient methods in this setting. Existing techniques rely on estimating the variance or covariance of the subsampling error, and typically assume constant variance. In this article, we propose a covariance-controlled adaptive Langevin thermostat that can effectively dissipate parameter-dependent noise while maintaining a desired target distribution. The proposed method achieves a substantial speedup over popular alternative schemes for large-scale machine learning applications.
FLU-DYNNov 21, 2014
Least-biased correction of extended dynamical systems using observational dataKeith Myerscough, Jason Frank, Benedict Leimkuhler
We consider dynamical systems evolving near an equilibrium statistical state where the interest is in modelling long term behavior that is consistent with thermodynamic constraints. We adjust the distribution using an entropy-optimizing formulation that can be computed on-the- fly, making possible partial corrections using incomplete information, for example measured data or data computed from a different model (or the same model at a different scale). We employ a thermostatting technique to sample the target distribution with the aim of capturing relavant statistical features while introducing mild dynamical perturbation (thermostats). The method is tested for a point vortex fluid model on the sphere, and we demonstrate both convergence of equilibrium quantities and the ability of the formulation to balance stationary and transient- regime errors.