NAApr 30, 2013
Complexity Analysis of Accelerated MCMC Methods for Bayesian InversionViet Ha Hoang, Christoph Schwab, Andrew M. Stuart
We study Bayesian inversion for a model elliptic PDE with unknown diffusion coefficient. We provide complexity analyses of several Markov Chain-Monte Carlo (MCMC) methods for the efficient numerical evaluation of expectations under the Bayesian posterior distribution, given data $δ$. Particular attention is given to bounds on the overall work required to achieve a prescribed error level $\varepsilon$. Specifically, we first bound the computational complexity of "plain" MCMC, based on combining MCMC sampling with linear complexity multilevel solvers for elliptic PDE. Our (new) work versus accuracy bounds show that the complexity of this approach can be quite prohibitive. Two strategies for reducing the computational complexity are then proposed and analyzed: first, a sparse, parametric and deterministic generalized polynomial chaos (gpc) "surrogate" representation of the forward response map of the PDE over the entire parameter space, and, second, a novel Multi-Level Markov Chain Monte Carlo (MLMCMC) strategy which utilizes sampling from a multilevel discretization of the posterior and of the forward PDE. For both of these strategies we derive asymptotic bounds on work versus accuracy, and hence asymptotic bounds on the computational complexity of the algorithms. In particular we provide sufficient conditions on the regularity of the unknown coefficients of the PDE, and on the approximation methods used, in order for the accelerations of MCMC resulting from these strategies to lead to complexity reductions over "plain" MCMC algorithms for Bayesian inversion of PDEs.}
NAMar 19, 2018
Parameterizations for Ensemble Kalman InversionNeil K. Chada, Marco A. Iglesias, Lassi Roininen et al.
The use of ensemble methods to solve inverse problems is attractive because it is a derivative-free methodology which is also well-adapted to parallelization. In its basic iterative form the method produces an ensemble of solutions which lie in the linear span of the initial ensemble. Choice of the parameterization of the unknown field is thus a key component of the success of the method. We demonstrate how both geometric ideas and hierarchical ideas can be used to design effective parameterizations for a number of applied inverse problems arising in electrical impedance tomography, groundwater flow and source inversion. In particular we show how geometric ideas, including the level set method, can be used to reconstruct piecewise continuous fields, and we show how hierarchical methods can be used to learn key parameters in continuous fields, such as length-scales, resulting in improved reconstructions. Geometric and hierarchical ideas are combined in the level set method to find piecewise constant reconstructions with interfaces of unknown topology.
NAAug 8, 2014
Algorithms for Kullback-Leibler Approximation of Probability Measures in Infinite DimensionsFrank J. Pinski, Gideon Simpson, Andrew M. Stuart et al.
In this paper we study algorithms to find a Gaussian approximation to a target measure defined on a Hilbert space of functions; the target measure itself is defined via its density with respect to a reference Gaussian measure. We employ the Kullback-Leibler divergence as a distance and find the best Gaussian approximation by minimizing this distance. It then follows that the approximate Gaussian must be equivalent to the Gaussian reference measure, defining a natural function space setting for the underlying calculus of variations problem. We introduce a computational algorithm which is well-adapted to the required minimization, seeking to find the mean as a function, and parameterizing the covariance in two different ways: through low rank perturbations of the reference covariance; and through Schrödinger potential perturbations of the inverse reference covariance. Two applications are shown: to a nonlinear inverse problem in elliptic PDEs, and to a conditioned diffusion process. We also show how the Gaussian approximations we obtain may be used to produce improved pCN-MCMC methods which are not only well-adapted to the high-dimensional setting, but also behave well with respect to small observational noise (resp. small temperatures) in the inverse problem (resp. conditioned diffusion).
NAApr 26, 2023
Nonlocality and Nonlinearity Implies Universality in Operator LearningSamuel Lanthaler, Zongyi Li, Andrew M. Stuart
Neural operator architectures approximate operators between infinite-dimensional Banach spaces of functions. They are gaining increased attention in computational science and engineering, due to their potential both to accelerate traditional numerical methods and to enable data-driven discovery. As the field is in its infancy basic questions about minimal requirements for universal approximation remain open. It is clear that any general approximation of operators between spaces of functions must be both nonlocal and nonlinear. In this paper we describe how these two attributes may be combined in a simple way to deduce universal approximation. In so doing we unify the analysis of a wide range of neural operator architectures and open up consideration of new ones. A popular variant of neural operators is the Fourier neural operator (FNO). Previous analysis proving universal operator approximation theorems for FNOs resorts to use of an unbounded number of Fourier modes, relying on intuition from traditional analysis of spectral methods. The present work challenges this point of view: (i) the work reduces FNO to its core essence, resulting in a minimal architecture termed the ``averaging neural operator'' (ANO); and (ii) analysis of the ANO shows that even this minimal ANO architecture benefits from universal approximation. This result is obtained based on only a spatial average as its only nonlocal ingredient (corresponding to retaining only a \emph{single} Fourier mode in the special case of the FNO). The analysis paves the way for a more systematic exploration of nonlocality, both through the development of new operator learning architectures and the analysis of existing and new architectures. Numerical results are presented which give insight into complexity issues related to the roles of channel width (embedding dimension) and number of Fourier modes.
LGJun 28, 2023
The Parametric Complexity of Operator LearningSamuel Lanthaler, Andrew M. Stuart
Neural operator architectures employ neural networks to approximate operators mapping between Banach spaces of functions; they may be used to accelerate model evaluations via emulation, or to discover models from data. Consequently, the methodology has received increasing attention over recent years, giving rise to the rapidly growing field of operator learning. The first contribution of this paper is to prove that for general classes of operators which are characterized only by their $C^r$- or Lipschitz-regularity, operator learning suffers from a "curse of parametric complexity", which is an infinite-dimensional analogue of the well-known curse of dimensionality encountered in high-dimensional approximation problems. The result is applicable to a wide variety of existing neural operators, including PCA-Net, DeepONet and the FNO.The second contribution of the paper is to prove that this general curse can be overcome for solution operators defined by the Hamilton-Jacobi equation; this is achieved by leveraging additional structure in the underlying solution operator, going beyond regularity. To this end, a novel neural operator architecture is introduced, termed HJ-Net, which explicitly takes into account characteristic information of the underlying Hamiltonian system. Error and complexity estimates are derived for HJ-Net which show that this architecture can provably beat the curse of parametric complexity related to the infinite-dimensional input and output function spaces.
PRJul 12, 2016
Weak error estimates for trajectories of SPDEs for Spectral Galerkin discretizationCharles-Edouard Bréhier, Martin Hairer, Andrew M. Stuart
We consider stochastic semi-linear evolution equations which are driven by additive, spatially correlated, Wiener noise, and in particular consider problems of heat equation (analytic semigroup) and damped-driven wave equations (bounded semigroup) type. We discretize these equations by means of a spectral Galerkin projection, and we study the approximation of the probability distribution of the trajectories: test functions are regular, but depend on the values of the process on the interval $[0,T]$. We introduce a new approach in the context of quantative weak error analysis for discretization of SPDEs. The weak error is formulated using a deterministic function (Itô map) of the stochastic convolution found when the nonlinear term is dropped. The regularity properties of the Itô map are exploited, and in particular second-order Taylor expansions employed, to transfer the error from spectral approximation of the stochastic convolution into the weak error of interest. We prove that the weak rate of convergence is twice the strong rate of convergence in two situations. First, we assume that the covariance operator commutes with the generator of the semigroup: the first order term in the weak error expansion cancels out thanks to an independence property. Second, we remove the commuting assumption, and extend the previous result, thanks to the analysis of a new error term depending on a commutator.
NAJun 21, 2023
Learning Homogenization for Elliptic OperatorsKaushik Bhattacharya, Nikola Kovachki, Aakila Rajan et al.
Multiscale partial differential equations (PDEs) arise in various applications, and several schemes have been developed to solve them efficiently. Homogenization theory is a powerful methodology that eliminates the small-scale dependence, resulting in simplified equations that are computationally tractable while accurately predicting the macroscopic response. In the field of continuum mechanics, homogenization is crucial for deriving constitutive laws that incorporate microscale physics in order to formulate balance laws for the macroscopic quantities of interest. However, obtaining homogenized constitutive laws is often challenging as they do not in general have an analytic form and can exhibit phenomena not present on the microscale. In response, data-driven learning of the constitutive law has been proposed as appropriate for this task. However, a major challenge in data-driven learning approaches for this problem has remained unexplored: the impact of discontinuities and corner interfaces in the underlying material. These discontinuities in the coefficients affect the smoothness of the solutions of the underlying equations. Given the prevalence of discontinuous materials in continuum mechanics applications, it is important to address the challenge of learning in this context; in particular, to develop underpinning theory that establishes the reliability of data-driven methods in this scientific domain. The paper addresses this unexplored challenge by investigating the learnability of homogenized constitutive laws for elliptic operators in the presence of such complexities. Approximation theory is presented, and numerical experiments are performed which validate the theory in the context of learning the solution operator defined by the cell problem arising in homogenization for elliptic PDEs.
DSAug 9, 2022
Second Order Ensemble Langevin Method for Sampling and Inverse ProblemsZiming Liu, Andrew M. Stuart, Yixuan Wang
We propose a sampling method based on an ensemble approximation of second order Langevin dynamics. The log target density is appended with a quadratic term in an auxiliary momentum variable and damped-driven Hamiltonian dynamics introduced; the resulting stochastic differential equation is invariant to the Gibbs measure, with marginal on the position coordinates given by the target. A preconditioner based on covariance under the law of the dynamics does not change this invariance property, and is introduced to accelerate convergence to the Gibbs measure. The resulting mean-field dynamics may be approximated by an ensemble method; this results in a gradient-free and affine-invariant stochastic dynamical system. Numerical results demonstrate its potential as the basis for a numerical sampler in Bayesian inverse problems.
46.9MLMay 5
Efficient Deconvolution in Populational Inverse ProblemsArnaud Vadeboncoeur, Mark Girolami, Andrew M. Stuart
This work is focussed on the inversion task of inferring the distribution over parameters of interest leading to multiple sets of observations. The potential to solve such distributional inversion problems is driven by increasing availability of data, but a major roadblock is blind deconvolution, arising when the observational noise distribution is unknown. However, when data originates from collections of physical systems, a population, it is possible to leverage this information to perform deconvolution. To this end, we propose a methodology leveraging large data sets of observations, collected from different instantiations of the same physical processes, to simultaneously deconvolve the data corrupting noise distribution, and to identify the distribution over model parameters defining the physical processes. A parameter-dependent mathematical model of the physical process is employed. A loss function characterizing the match between the observed data and the output of the mathematical model is defined; it is minimized as a function of the both the parameter inputs to the model of the physics and the parameterized observational noise. This coupled problem is addressed with a modified gradient descent algorithm that leverages specific structure in the noise model. Furthermore, a new active learning scheme is proposed, based on adaptive empirical measures, to train a surrogate model to be accurate in parameter regions of interest; this approach accelerates computation and enables automatic differentiation of black-box, potentially nondifferentiable, code computing parameter-to-solution maps. The proposed methodology is demonstrated on porous medium flow, damped elastodynamics, and simplified models of atmospheric dynamics.
GEO-PHOct 23, 2023
Modeling groundwater levels in California's Central Valley by hierarchical Gaussian process and neural network regressionAnshuman Pradhan, Kyra H. Adams, Venkat Chandrasekaran et al.
Modeling groundwater levels continuously across California's Central Valley (CV) hydrological system is challenging due to low-quality well data which is sparsely and noisily sampled across time and space. The lack of consistent well data makes it difficult to evaluate the impact of 2017 and 2019 wet years on CV groundwater following a severe drought during 2012-2015. A novel machine learning method is formulated for modeling groundwater levels by learning from a 3D lithological texture model of the CV aquifer. The proposed formulation performs multivariate regression by combining Gaussian processes (GP) and deep neural networks (DNN). The hierarchical modeling approach constitutes training the DNN to learn a lithologically informed latent space where non-parametric regression with GP is performed. We demonstrate the efficacy of GP-DNN regression for modeling non-stationary features in the well data with fast and reliable uncertainty quantification, as validated to be statistically consistent with the empirical data distribution from 90 blind wells across CV. We show how the model predictions may be used to supplement hydrological understanding of aquifer responses in basins with irregular well data. Our results indicate that on average the 2017 and 2019 wet years in California were largely ineffective in replenishing the groundwater loss caused during previous drought years.
98.6MLMar 20
Operator Learning for Smoothing and ForecastingEdoardo Calvello, Elizabeth Carlson, Nikola Kovachki et al.
Machine learning has opened new frontiers in purely data-driven algorithms for data assimilation in, and for forecasting of, dynamical systems; the resulting methods are showing some promise. However, in contrast to model-driven algorithms, analysis of these data-driven methods is poorly developed. In this paper we address this issue, developing a theory to underpin data-driven methods to solve smoothing problems arising in data assimilation and forecasting problems. The theoretical framework relies on two key components: (i) establishing the existence of the mapping to be learned; (ii) the properties of the operator learning architecture used to approximate this mapping. By studying these two components in conjunction, we establish the first universal approximation theorem for purely data-driven algorithms for both smoothing and forecasting of dynamical systems. We work in the continuous time setting, hence deploying neural operator architectures. The theoretical results are illustrated with experiments studying the Lorenz `63, Lorenz `96 and Kuramoto-Sivashinsky dynamical systems.
MLAug 2, 2024
Autoencoders in Function SpaceJustin Bunker, Mark Girolami, Hefin Lambley et al.
Autoencoders have found widespread application in both their original deterministic form and in their variational formulation (VAEs). In scientific applications and in image processing it is often of interest to consider data that are viewed as functions; while discretisation (of differential equations arising in the sciences) or pixellation (of images) renders problems finite dimensional in practice, conceiving first of algorithms that operate on functions, and only then discretising or pixellating, leads to better algorithms that smoothly operate between resolutions. In this paper function-space versions of the autoencoder (FAE) and variational autoencoder (FVAE) are introduced, analysed, and deployed. Well-definedness of the objective governing VAEs is a subtle issue, particularly in function space, limiting applicability. For the FVAE objective to be well defined requires compatibility of the data distribution with the chosen generative model; this can be achieved, for example, when the data arise from a stochastic differential equation, but is generally restrictive. The FAE objective, on the other hand, is well defined in many situations where FVAE fails to be. Pairing the FVAE and FAE objectives with neural operator architectures that can be evaluated on any mesh enables new applications of autoencoders to inpainting, superresolution, and generative modelling of scientific data.
LGAug 12, 2024
Operator Learning Using Random Features: A Tool for Scientific ComputingNicholas H. Nelsen, Andrew M. Stuart
Supervised operator learning centers on the use of training data, in the form of input-output pairs, to estimate maps between infinite-dimensional spaces. It is emerging as a powerful tool to complement traditional scientific computing, which may often be framed in terms of operators mapping between spaces of functions. Building on the classical random features methodology for scalar regression, this paper introduces the function-valued random features method. This leads to a supervised operator learning architecture that is practical for nonlinear problems yet is structured enough to facilitate efficient training through the optimization of a convex, quadratic cost. Due to the quadratic structure, the trained model is equipped with convergence guarantees and error and complexity bounds, properties that are not readily available for most other operator learning architectures. At its core, the proposed approach builds a linear combination of random operators. This turns out to be a low-rank approximation of an operator-valued kernel ridge regression algorithm, and hence the method also has strong connections to Gaussian process regression. The paper designs function-valued random features that are tailored to the structure of two nonlinear operator learning benchmark problems arising from parametric partial differential equations. Numerical results demonstrate the scalability, discretization invariance, and transferability of the function-valued random features method.
FLU-DYNFeb 26
Neural ensemble Kalman filter: Data assimilation for compressible flows with shocksXu-Hui Zhou, Lorenzo Beronilla, Michael K. Sleeman et al.
Data assimilation (DA) for compressible flows with shocks is challenging because many classical DA methods generate spurious oscillations and nonphysical features near uncertain shocks. We focus here on the ensemble Kalman filter (EnKF). We show that the poor performance of the standard EnKF may be attributed to the bimodal forecast distribution that can arise in the vicinity of an uncertain shock location; this violates the assumptions underpinning the EnKF, which assume a forecast which is close to Gaussian. To address this issue we introduce the new neural EnKF. The basic idea is to systematically embed neural function approximations within ensemble DA by mapping the forecast ensemble of shocked flows to the parameter space (weights and biases) of a deep neural network (NN) and to subsequently perform DA in that space. The nonlinear mapping encodes sharp and smooth flow features in an ensemble of NN parameters. Neural EnKF updates are therefore well-behaved only if the NN parameters vary smoothly within the neural representation of the forecast ensemble. We show that such a smooth variation of network parameters can be enforced via physics-informed transfer learning, and demonstrate that in so-doing the neural EnKF avoids the spurious oscillations and nonphysical features that plague the standard EnKF. The applicability of the neural EnKF is demonstrated through a series of systematic numerical experiments with an inviscid Burgers' equation, Sod's shock tube, and a two-dimensional blast wave.
75.2NAMay 14
Amortized Energy-Based Bayesian InferenceHojjat Kaveh, Ricardo Baptista, Andrew M. Stuart
We consider amortized Bayesian inference for nonlinear inverse problems in settings where only samples from the joint distribution of parameters and observations are available. Classical methods such as Markov chain Monte Carlo require solving a new inference problem for each observation, which can be computationally prohibitive when inference must be repeated many times. We propose a transport-based approach that learns an observation-dependent map pushing forward a reference measure to approximate the posterior distribution. The map is trained by minimizing an averaged energy-distance objective between the true posterior and the learned pushforward. This formulation is likelihood-free, requiring only joint samples, and avoids density evaluation, invertibility constraints, and Jacobian determinant computations. For function-space inverse problems with Gaussian priors, we parameterize the transport map as the identity plus a perturbation in the Cameron-Martin space of the prior, preserving absolute continuity with respect to the prior. In infinite-dimensional settings, the map is represented using neural operators. We illustrate the method on a finite-dimensional nonlinear inverse problem and two PDE-constrained inverse problems arising in porous medium flow and seismic inversion. The results show that the learned transport captures posterior structure, including multimodality and dominant modes, while enabling fast posterior sampling for new observations.
LGFeb 24, 2024
Operator Learning: Algorithms and AnalysisNikola B. Kovachki, Samuel Lanthaler, Andrew M. Stuart
Operator learning refers to the application of ideas from machine learning to approximate (typically nonlinear) operators mapping between Banach spaces of functions. Such operators often arise from physical models expressed in terms of partial differential equations (PDEs). In this context, such approximate operators hold great potential as efficient surrogate models to complement traditional numerical methods in many-query tasks. Being data-driven, they also enable model discovery when a mathematical description in terms of a PDE is not available. This review focuses primarily on neural operators, built on the success of deep neural networks in the approximation of functions defined on finite dimensional Euclidean spaces. Empirically, neural operators have shown success in a variety of applications, but our theoretical understanding remains incomplete. This review article summarizes recent progress and the current state of our theoretical understanding of neural operators, focusing on an approximation theoretic point of view.
LGJan 27, 2025
Memorization and Regularization in Generative Diffusion ModelsRicardo Baptista, Agnimitra Dasgupta, Nikola B. Kovachki et al.
Diffusion models have emerged as a powerful framework for generative modeling. At the heart of the methodology is score matching: learning gradients of families of log-densities for noisy versions of the data distribution at different scales. When the loss function adopted in score matching is evaluated using empirical data, rather than the population loss, the minimizer corresponds to the score of a time-dependent Gaussian mixture. However, use of this analytically tractable minimizer leads to data memorization: in both unconditioned and conditioned settings, the generative model returns the training samples. This paper contains an analysis of the dynamical mechanism underlying memorization. The analysis highlights the need for regularization to avoid reproducing the analytically tractable minimizer; and, in so doing, lays the foundations for a principled understanding of how to regularize. Numerical experiments investigate the properties of: (i) Tikhonov regularization; (ii) regularization designed to promote asymptotic consistency; and (iii) regularizations induced by under-parameterization of a neural network or by early stopping when training a neural network. These experiments are evaluated in the context of memorization, and directions for future development of regularization are highlighted.
NAMay 3, 2024
Discretization Error of Fourier Neural OperatorsSamuel Lanthaler, Andrew M. Stuart, Margaret Trautner
Operator learning is a variant of machine learning that is designed to approximate maps between function spaces from data. The Fourier Neural Operator (FNO) is one of the main model architectures used for operator learning. The FNO combines linear and nonlinear operations in physical space with linear operations in Fourier space, leading to a parameterized map acting between function spaces. Although in definition, FNOs are objects in continuous space and perform convolutions on a continuum, their implementation is a discretized object performing computations on a grid, allowing efficient implementation via the FFT. Thus, there is a discretization error between the continuum FNO definition and the discretized object used in practice that is separate from other previously analyzed sources of model error. We examine this discretization error here and obtain algebraic rates of convergence in terms of the grid resolution as a function of the input regularity. Numerical experiments that validate the theory and describe model stability are performed. In addition, an algorithm is presented that leverages the discretization error and model error decomposition to optimize computational training time.
NAJan 28, 2025
Solving Roughly Forced Nonlinear PDEs via Misspecified Kernel Methods and Neural NetworksRicardo Baptista, Edoardo Calvello, Matthieu Darcy et al.
We consider the use of Gaussian Processes (GPs) or Neural Networks (NNs) to numerically approximate the solutions to nonlinear partial differential equations (PDEs) with rough forcing or source terms, which commonly arise as pathwise solutions to stochastic PDEs. Kernel methods have recently been generalized to solve nonlinear PDEs by approximating their solutions as the maximum a posteriori estimator of GPs that are conditioned to satisfy the PDE at a finite set of collocation points. The convergence and error guarantees of these methods, however, rely on the PDE being defined in a classical sense and its solution possessing sufficient regularity to belong to the associated reproducing kernel Hilbert space. We propose a generalization of these methods to handle roughly forced nonlinear PDEs while preserving convergence guarantees with an oversmoothing GP kernel that is misspecified relative to the true solution's regularity. This is achieved by conditioning a regular GP to satisfy the PDE with a modified source term in a weak sense (when integrated against a finite number of test functions). This is equivalent to replacing the empirical $L^2$-loss on the PDE constraint by an empirical negative-Sobolev norm. We further show that this loss function can be used to extend physics-informed neural networks (PINNs) to stochastic equations, thereby resulting in a new NN-based variant termed Negative Sobolev Norm-PINN (NeS-PINN).
MLOct 7, 2025
Bilevel optimization for learning hyperparameters: Application to solving PDEs and inverse problems with Gaussian processesNicholas H. Nelsen, Houman Owhadi, Andrew M. Stuart et al.
Methods for solving scientific computing and inference problems, such as kernel- and neural network-based approaches for partial differential equations (PDEs), inverse problems, and supervised learning tasks, depend crucially on the choice of hyperparameters. Specifically, the efficacy of such methods, and in particular their accuracy, stability, and generalization properties, strongly depends on the choice of hyperparameters. While bilevel optimization offers a principled framework for hyperparameter tuning, its nested optimization structure can be computationally demanding, especially in PDE-constrained contexts. In this paper, we propose an efficient strategy for hyperparameter optimization within the bilevel framework by employing a Gauss-Newton linearization of the inner optimization step. Our approach provides closed-form updates, eliminating the need for repeated costly PDE solves. As a result, each iteration of the outer loop reduces to a single linearized PDE solve, followed by explicit gradient-based hyperparameter updates. We demonstrate the effectiveness of the proposed method through Gaussian process models applied to nonlinear PDEs and to PDE inverse problems. Extensive numerical experiments highlight substantial improvements in accuracy and robustness compared to conventional random hyperparameter initialization. In particular, experiments with additive kernels and neural network-parameterized deep kernels demonstrate the method's scalability and effectiveness for high-dimensional hyperparameter optimization.
MLMay 30, 2025
A Mathematical Perspective On Contrastive LearningRicardo Baptista, Andrew M. Stuart, Son Tran
Multimodal contrastive learning is a methodology for linking different data modalities; the canonical example is linking image and text data. The methodology is typically framed as the identification of a set of encoders, one for each modality, that align representations within a common latent space. In this work, we focus on the bimodal setting and interpret contrastive learning as the optimization of (parameterized) encoders that define conditional probability distributions, for each modality conditioned on the other, consistent with the available data. This provides a framework for multimodal algorithms such as crossmodal retrieval, which identifies the mode of one of these conditional distributions, and crossmodal classification, which is similar to retrieval but includes a fine-tuning step to make it task specific. The framework we adopt also gives rise to crossmodal generative models. This probabilistic perspective suggests two natural generalizations of contrastive learning: the introduction of novel probabilistic loss functions, and the use of alternative metrics for measuring alignment in the common latent space. We study these generalizations of the classical approach in the multivariate Gaussian setting. In this context we view the latent space identification as a low-rank matrix approximation problem. This allows us to characterize the capabilities of loss functions and alignment metrics to approximate natural statistics, such as conditional means and covariances; doing so yields novel variants on contrastive learning algorithms for specific mode-seeking and for generative tasks. The framework we introduce is also studied through numerical experiments on multivariate Gaussians, the labeled MNIST dataset, and on a data assimilation application arising in oceanography.
LGJun 25, 2024
Efficient, Multimodal, and Derivative-Free Bayesian Inference With Fisher-Rao Gradient FlowsYifan Chen, Daniel Zhengyu Huang, Jiaoyang Huang et al.
In this paper, we study efficient approximate sampling for probability distributions known up to normalization constants. We specifically focus on a problem class arising in Bayesian inference for large-scale inverse problems in science and engineering applications. The computational challenges we address with the proposed methodology are: (i) the need for repeated evaluations of expensive forward models; (ii) the potential existence of multiple modes; and (iii) the fact that gradient of, or adjoint solver for, the forward model might not be feasible. While existing Bayesian inference methods meet some of these challenges individually, we propose a framework that tackles all three systematically. Our approach builds upon the Fisher-Rao gradient flow in probability space, yielding a dynamical system for probability densities that converges towards the target distribution at a uniform exponential rate. This rapid convergence is advantageous for the computational burden outlined in (i). We apply Gaussian mixture approximations with operator splitting techniques to simulate the flow numerically; the resulting approximation can capture multiple modes thus addressing (ii). Furthermore, we employ the Kalman methodology to facilitate a derivative-free update of these Gaussian components and their respective weights, addressing the issue in (iii). The proposed methodology results in an efficient derivative-free sampler flexible enough to handle multi-modal distributions: Gaussian Mixture Kalman Inversion (GMKI). The effectiveness of GMKI is demonstrated both theoretically and numerically in several experiments with multimodal target distributions, including proof-of-concept and two-dimensional examples, as well as a large-scale application: recovering the Navier-Stokes initial condition from solution data at positive times.
LGJun 10, 2024
Continuum Attention for Neural OperatorsEdoardo Calvello, Nikola B. Kovachki, Matthew E. Levine et al.
Transformers, and the attention mechanism in particular, have become ubiquitous in machine learning. Their success in modeling nonlocal, long-range correlations has led to their widespread adoption in natural language processing, computer vision, and time series problems. Neural operators, which map spaces of functions into spaces of functions, are necessarily both nonlinear and nonlocal if they are universal; it is thus natural to ask whether the attention mechanism can be used in the design of neural operators. Motivated by this, we study transformers in the function space setting. We formulate attention as a map between infinite dimensional function spaces and prove that the attention mechanism as implemented in practice is a Monte Carlo or finite difference approximation of this operator. The function space formulation allows for the design of transformer neural operators, a class of architectures designed to learn mappings between function spaces. In this paper, we state and prove the first universal approximation result for transformer neural operators, using only a slight modification of the architecture implemented in practice. The prohibitive cost of applying the attention operator to functions defined on multi-dimensional domains leads to the need for more efficient attention-based architectures. For this reason we also introduce a function space generalization of the patching strategy from computer vision, and introduce a class of associated neural operators. Numerical results, on an array of operator learning problems, demonstrate the promise of our approaches to function space formulations of attention and their use in neural operators.
STAug 27, 2021
Convergence Rates for Learning Linear Operators from Noisy DataMaarten V. de Hoop, Nikola B. Kovachki, Nicholas H. Nelsen et al.
This paper studies the learning of linear operators between infinite-dimensional Hilbert spaces. The training data comprises pairs of random input vectors in a Hilbert space and their noisy images under an unknown self-adjoint linear operator. Assuming that the operator is diagonalizable in a known basis, this work solves the equivalent inverse problem of estimating the operator's eigenvalues given the data. Adopting a Bayesian approach, the theoretical analysis establishes posterior contraction rates in the infinite data limit with Gaussian priors that are not directly linked to the forward map of the inverse problem. The main results also include learning-theoretic generalization error guarantees for a wide range of distribution shifts. These convergence rates quantify the effects of data smoothness and true eigenvalue decay or growth, for compact or unbounded operators, respectively, on sample complexity. Numerical evidence supports the theory in diagonal and non-diagonal settings.
DSJul 14, 2021
A Framework for Machine Learning of Model Error in Dynamical SystemsMatthew E. Levine, Andrew M. Stuart
The development of data-informed predictive models for dynamical systems is of widespread interest in many disciplines. We present a unifying framework for blending mechanistic and machine-learning approaches to identify dynamical systems from noisily and partially observed data. We compare pure data-driven learning with hybrid models which incorporate imperfect domain knowledge. Our formulation is agnostic to the chosen machine learning model, is presented in both continuous- and discrete-time settings, and is compatible both with model errors that exhibit substantial memory and errors that are memoryless. First, we study memoryless linear (w.r.t. parametric-dependence) model error from a learning theory perspective, defining excess risk and generalization error. For ergodic continuous-time systems, we prove that both excess risk and generalization error are bounded above by terms that diminish with the square-root of T, the time-interval over which training data is specified. Secondly, we study scenarios that benefit from modeling with memory, proving universal approximation theorems for two classes of continuous-time recurrent neural networks (RNNs): both can learn memory-dependent model error. In addition, we connect one class of RNNs to reservoir computing, thereby relating learning of memory-dependent error to recent work on supervised learning between Banach spaces using random features. Numerical results are presented (Lorenz '63, Lorenz '96 Multiscale systems) to compare purely data-driven and hybrid approaches, finding hybrid methods less data-hungry and more parametrically efficient. Finally, we demonstrate numerically how data assimilation can be leveraged to learn hidden dynamics from noisy, partially-observed data, and illustrate challenges in representing memory by this approach, and in the training of such models.
NAApr 7, 2021
Ensemble Inference Methods for Models With Noisy and Expensive LikelihoodsOliver R. A. Dunbar, Andrew B. Duncan, Andrew M. Stuart et al.
The increasing availability of data presents an opportunity to calibrate unknown parameters which appear in complex models of phenomena in the biomedical, physical and social sciences. However, model complexity often leads to parameter-to-data maps which are expensive to evaluate and are only available through noisy approximations. This paper is concerned with the use of interacting particle systems for the solution of the resulting inverse problems for parameters. Of particular interest is the case where the available forward model evaluations are subject to rapid fluctuations, in parameter space, superimposed on the smoothly varying large scale parametric structure of interest. {A motivating example from climate science is presented, and ensemble Kalman methods (which do not use the derivative of the parameter-to-data map) are shown, empirically, to perform well. Multiscale analysis is then used to analyze the behaviour of interacting particle system algorithms when rapid fluctuations, which we refer to as noise, pollute the large scale parametric dependence of the parameter-to-data map. Ensemble Kalman methods and Langevin-based methods} (the latter use the derivative of the parameter-to-data map) are compared in this light. The ensemble Kalman methods are shown to behave favourably in the presence of noise in the parameter-to-data map, whereas Langevin methods are adversely affected. On the other hand, Langevin methods have the correct equilibrium distribution in the setting of noise-free forward models, whilst ensemble Kalman methods only provide an uncontrolled approximation, except in the linear case. Therefore a new class of algorithms, ensemble Gaussian process samplers, which combine the benefits of both ensemble Kalman and Langevin methods, are introduced and shown to perform favourably.
MLJul 25, 2020
Posterior Consistency of Semi-Supervised Regression on GraphsAndrea L. Bertozzi, Bamdad Hosseini, Hao Li et al.
Graph-based semi-supervised regression (SSR) is the problem of estimating the value of a function on a weighted graph from its values (labels) on a small subset of the vertices. This paper is concerned with the consistency of SSR in the context of classification, in the setting where the labels have small noise and the underlying graph weighting is consistent with well-clustered nodes. We present a Bayesian formulation of SSR in which the weighted graph defines a Gaussian prior, using a graph Laplacian, and the labeled data defines a likelihood. We analyze the rate of contraction of the posterior measure around the ground truth in terms of parameters that quantify the small label error and inherent clustering in the graph. We obtain bounds on the rates of contraction and illustrate their sharpness through numerical experiments. The analysis also gives insight into the choice of hyperparameters that enter the definition of the prior.
STMay 22, 2020
Consistency of Empirical Bayes And Kernel Flow For Hierarchical Parameter EstimationYifan Chen, Houman Owhadi, Andrew M. Stuart
Gaussian process regression has proven very powerful in statistics, machine learning and inverse problems. A crucial aspect of the success of this methodology, in a wide range of applications to complex and real-world problems, is hierarchical modeling and learning of hyperparameters. The purpose of this paper is to study two paradigms of learning hierarchical parameters: one is from the probabilistic Bayesian perspective, in particular, the empirical Bayes approach that has been largely used in Bayesian statistics; the other is from the deterministic and approximation theoretic view, and in particular the kernel flow algorithm that was proposed recently in the machine learning literature. Analysis of their consistency in the large data limit, as well as explicit identification of their implicit bias in parameter learning, are established in this paper for a Matérn-like model on the torus. A particular technical challenge we overcome is the learning of the regularity parameter in the Matérn-like field, for which consistency results have been very scarce in the spatial statistics literature. Moreover, we conduct extensive numerical experiments beyond the Matérn-like model, comparing the two algorithms further. These experiments demonstrate learning of other hierarchical parameters, such as amplitude and lengthscale; they also illustrate the setting of model misspecification in which the kernel flow approach could show superior performance to the more traditional empirical Bayes approach.
NAMay 20, 2020
The Random Feature Model for Input-Output Maps between Banach SpacesNicholas H. Nelsen, Andrew M. Stuart
Well known to the machine learning community, the random feature model is a parametric approximation to kernel interpolation or regression methods. It is typically used to approximate functions mapping a finite-dimensional input space to the real line. In this paper, we instead propose a methodology for use of the random feature model as a data-driven surrogate for operators that map an input Banach space to an output Banach space. Although the methodology is quite general, we consider operators defined by partial differential equations (PDEs); here, the inputs and outputs are themselves functions, with the input parameters being functions required to specify the problem, such as initial data or coefficients, and the outputs being solutions of the problem. Upon discretization, the model inherits several desirable attributes from this infinite-dimensional viewpoint, including mesh-invariant approximation error with respect to the true PDE solution map and the capability to be trained at one mesh resolution and then deployed at different mesh resolutions. We view the random feature model as a non-intrusive data-driven emulator, provide a mathematical framework for its interpretation, and demonstrate its ability to efficiently and accurately approximate the nonlinear parameter-to-solution maps of two prototypical PDEs arising in physical science and engineering applications: viscous Burgers' equation and a variable coefficient elliptic equation.
NAMay 7, 2020
Model Reduction and Neural Networks for Parametric PDEsKaushik Bhattacharya, Bamdad Hosseini, Nikola B. Kovachki et al.
We develop a general framework for data-driven approximation of input-output maps between infinite-dimensional spaces. The proposed approach is motivated by the recent successes of neural networks and deep learning, in combination with ideas from model reduction. This combination results in a neural network approximation which, in principle, is defined on infinite-dimensional spaces and, in practice, is robust to the dimension of finite-dimensional approximations of these spaces required for computation. For a class of input-output maps, and suitably chosen probability measures on the inputs, we prove convergence of the proposed approximation methodology. We also include numerical experiments which demonstrate the effectiveness of the method, showing convergence and robustness of the approximation scheme with respect to the size of the discretization, and compare it with existing algorithms from the literature; our examples include the mapping from coefficient to solution in a divergence form elliptic partial differential equation (PDE) problem, and the solution operator for viscous Burgers' equation.
SPSep 13, 2019
Spectral Analysis Of Weighted Laplacians Arising In Data ClusteringFranca Hoffmann, Bamdad Hosseini, Assad A. Oberai et al.
Graph Laplacians computed from weighted adjacency matrices are widely used to identify geometric structure in data, and clusters in particular; their spectral properties play a central role in a number of unsupervised and semi-supervised learning algorithms. When suitably scaled, graph Laplacians approach limiting continuum operators in the large data limit. Studying these limiting operators, therefore, sheds light on learning algorithms. This paper is devoted to the study of a parameterized family of divergence form elliptic operators that arise as the large data limit of graph Laplacians. The link between a three-parameter family of graph Laplacians and a three-parameter family of differential operators is explained. The spectral properties of these differential operators are analyzed in the situation where the data comprises two nearly separated clusters, in a sense which is made precise. In particular, we investigate how the spectral gap depends on the three parameters entering the graph Laplacian, and on a parameter measuring the size of the perturbation from the perfectly clustered case. Numerical results are presented which exemplify and extend the analysis: the computations study situations in which there are two nearly separated clusters, but which violate the assumptions used in our theory; situations in which more than two clusters are present, also going beyond our theory; and situations which demonstrate the relevance of our studies of differential operators for the understanding of finite data problems via the graph Laplacian. The findings provide insight into parameter choices made in learning algorithms which are based on weighted adjacency matrices; they also provide the basis for analysis of the consistency of various unsupervised and semi-supervised learning algorithms, in the large data limit.
MLJun 18, 2019
Consistency of semi-supervised learning algorithms on graphs: Probit and one-hot methodsFranca Hoffmann, Bamdad Hosseini, Zhi Ren et al.
Graph-based semi-supervised learning is the problem of propagating labels from a small number of labelled data points to a larger set of unlabelled data. This paper is concerned with the consistency of optimization-based techniques for such problems, in the limit where the labels have small noise and the underlying unlabelled data is well clustered. We study graph-based probit for binary classification, and a natural generalization of this method to multi-class classification using one-hot encoding. The resulting objective function to be optimized comprises the sum of a quadratic form defined through a rational function of the graph Laplacian, involving only the unlabelled data, and a fidelity term involving only the labelled data. The consistency analysis sheds light on the choice of the rational function defining the optimization.
LGJun 10, 2019
Continuous Time Analysis of Momentum MethodsNikola B. Kovachki, Andrew M. Stuart
Gradient descent-based optimization methods underpin the parameter training of neural networks, and hence comprise a significant component in the impressive test results found in a number of applications. Introducing stochasticity is key to their success in practical problems, and there is some understanding of the role of stochastic gradient descent in this context. Momentum modifications of gradient descent such as Polyak's Heavy Ball method (HB) and Nesterov's method of accelerated gradients (NAG), are also widely adopted. In this work our focus is on understanding the role of momentum in the training of neural networks, concentrating on the common situation in which the momentum contribution is fixed at each step of the algorithm. To expose the ideas simply we work in the deterministic setting. Our approach is to derive continuous time approximations of the discrete algorithms; these continuous time approximations provide insights into the mechanisms at play within the discrete algorithms. We prove three such approximations. Firstly we show that standard implementations of fixed momentum methods approximate a time-rescaled gradient descent flow, asymptotically as the learning rate shrinks to zero; this result does not distinguish momentum methods from pure gradient descent, in the limit of vanishing learning rate. We then proceed to prove two results aimed at understanding the observed practical advantages of fixed momentum methods over gradient descent. We achieve this by proving approximations to continuous time limits in which the small but fixed learning rate appears as a parameter. Furthermore in a third result we show that the momentum methods admit an exponentially attractive invariant manifold on which the dynamics reduces, approximately, to a gradient flow with respect to a modified loss function.
STMay 10, 2019
Hyperparameter Estimation in Bayesian MAP Estimation: Parameterizations and ConsistencyMatthew M. Dunlop, Tapio Helin, Andrew M. Stuart
The Bayesian formulation of inverse problems is attractive for three primary reasons: it provides a clear modelling framework; means for uncertainty quantification; and it allows for principled learning of hyperparameters. The posterior distribution may be explored by sampling methods, but for many problems it is computationally infeasible to do so. In this situation maximum a posteriori (MAP) estimators are often sought. Whilst these are relatively cheap to compute, and have an attractive variational formulation, a key drawback is their lack of invariance under change of parameterization. This is a particularly significant issue when hierarchical priors are employed to learn hyperparameters. In this paper we study the effect of the choice of parameterization on MAP estimators when a conditionally Gaussian hierarchical prior distribution is employed. Specifically we consider the centred parameterization, the natural parameterization in which the unknown state is solved for directly, and the noncentred parameterization, which works with a whitened Gaussian as the unknown state variable, and arises when considering dimension-robust MCMC algorithms; MAP estimation is well-defined in the nonparametric setting only for the noncentred parameterization. However, we show that MAP estimates based on the noncentred parameterization are not consistent as estimators of hyperparameters; conversely, we show that limits of finite-dimensional centred MAP estimators are consistent as the dimension tends to infinity. We also consider empirical Bayesian hyperparameter estimation, show consistency of these estimates, and demonstrate that they are more robust with respect to noise than centred MAP estimates. An underpinning concept throughout is that hyperparameters may only be recovered up to measure equivalence, a well-known phenomenon in the context of the Ornstein-Uhlenbeck process.
LGAug 10, 2018
Ensemble Kalman Inversion: A Derivative-Free Technique For Machine Learning TasksNikola B. Kovachki, Andrew M. Stuart
The standard probabilistic perspective on machine learning gives rise to empirical risk-minimization tasks that are frequently solved by stochastic gradient descent (SGD) and variants thereof. We present a formulation of these tasks as classical inverse or filtering problems and, furthermore, we propose an efficient, gradient-free algorithm for finding a solution to these problems using ensemble Kalman inversion (EKI). Applications of our approach include offline and online supervised learning with deep neural networks, as well as graph-based semi-supervised learning. The essence of the EKI procedure is an ensemble based approximate gradient descent in which derivatives are replaced by differences from within the ensemble. We suggest several modifications to the basic method, derived from empirically successful heuristics developed in the context of SGD. Numerical results demonstrate wide applicability and robustness of the proposed algorithm.
MLMay 23, 2018
Large Data and Zero Noise Limits of Graph-Based Semi-Supervised Learning AlgorithmsMatthew M. Dunlop, Dejan Slepčev, Andrew M. Stuart et al.
Scalings in which the graph Laplacian approaches a differential operator in the large graph limit are used to develop understanding of a number of algorithms for semi-supervised learning; in particular the extension, to this graph setting, of the probit algorithm, level set and kriging methods, are studied. Both optimization and Bayesian approaches are considered, based around a regularizing quadratic form found from an affine transformation of the Laplacian, raised to a, possibly fractional, exponent. Conditions on the parameters defining this quadratic form are identified under which well-defined limiting continuum analogues of the optimization and Bayesian semi-supervised learning problems may be found, thereby shedding light on the design of algorithms in the large graph setting. The large graph limits of the optimization formulations are tackled through $Γ-$convergence, using the recently introduced $TL^p$ metric. The small labelling noise limits of the Bayesian formulations are also identified, and contrasted with pre-existing harmonic function approaches to the problem.
MEMar 9, 2018
Dimension-Robust MCMC in Bayesian Inverse ProblemsVictor Chen, Matthew M. Dunlop, Omiros Papaspiliopoulos et al.
The methodology developed in this article is motivated by a wide range of prediction and uncertainty quantification problems that arise in Statistics, Machine Learning and Applied Mathematics, such as non-parametric regression, multi-class classification and inversion of partial differential equations. One popular formulation of such problems is as Bayesian inverse problems, where a prior distribution is used to regularize inference on a high-dimensional latent state, typically a function or a field. It is common that such priors are non-Gaussian, for example piecewise-constant or heavy-tailed, and/or hierarchical, in the sense of involving a further set of low-dimensional parameters, which, for example, control the scale or smoothness of the latent state. In this formulation prediction and uncertainty quantification relies on efficient exploration of the posterior distribution of latent states and parameters. This article introduces a framework for efficient MCMC sampling in Bayesian inverse problems that capitalizes upon two fundamental ideas in MCMC, non-centred parameterisations of hierarchical models and dimension-robust samplers for latent Gaussian processes. Using a range of diverse applications we showcase that the proposed framework is dimension-robust, that is, the efficiency of the MCMC sampling does not deteriorate as the dimension of the latent state gets higher. We showcase the full potential of the machinery we develop in the article in semi-supervised multi-class classification, where our sampling algorithm is used within an active learning framework to guide the selection of input data to manually label in order to achieve high predictive accuracy with a minimal number of labelled data.
LGMar 26, 2017
Uncertainty quantification in graph-based classification of high dimensional dataAndrea L. Bertozzi, Xiyang Luo, Andrew M. Stuart et al.
Classification of high dimensional data finds wide-ranging applications. In many of these applications equipping the resulting classification with a measure of uncertainty may be as important as the classification itself. In this paper we introduce, develop algorithms for, and investigate the properties of, a variety of Bayesian models for the task of binary classification; via the posterior distribution on the classification labels, these methods automatically give measures of uncertainty. The methods are all based around the graph formulation of semi-supervised learning. We provide a unified framework which brings together a variety of methods which have been introduced in different communities within the mathematical sciences. We study probit classification in the graph-based setting, generalize the level-set method for Bayesian inverse problems to the classification setting, and generalize the Ginzburg-Landau optimization-based classifier to a Bayesian setting; we also show that the probit and level set approaches are natural relaxations of the harmonic function approach introduced in [Zhu et al 2003]. We introduce efficient numerical methods, suited to large data-sets, for both MCMC-based sampling as well as gradient-based MAP estimation. Through numerical experiments we study classification accuracy and uncertainty quantification for our models; these experiments showcase a suite of datasets commonly used to evaluate graph-based semi-supervised learning algorithms.
NASep 20, 2016
Analysis of the ensemble Kalman filter for inverse problemsClaudia Schillings, Andrew M. Stuart
The ensemble Kalman filter (EnKF) is a widely used methodology for state estimation in partial, noisily observed dynamical systems, and for parameter estimation in inverse problems. Despite its widespread use in the geophysical sciences, and its gradual adoption in many other areas of application, analysis of the method is in its infancy. Furthermore, much of the existing analysis deals with the large ensemble limit, far from the regime in which the method is typically used. The goal of this paper is to analyze the method when applied to inverse problems with fixed ensemble size. A continuous-time limit is derived and the long-time behavior of the resulting dynamical system is studied. Most of the rigorous analysis is confined to the linear forward problem, where we demonstrate that the continuous time limit of the EnKF corresponds to a set of gradient flows for the data misfit in each ensemble member, coupled through a common pre-conditioner which is the empirical covariance matrix of the ensemble. Numerical results demonstrate that the conclusions of the analysis extend beyond the linear inverse problem setting. Numerical experiments are also given which demonstrate the benefits of various extensions of the basic methodology.
PRSep 9, 2016
The Bayesian Formulation of EIT: Analysis and AlgorithmsMatthew M. Dunlop, Andrew M. Stuart
We provide a rigorous Bayesian formulation of the EIT problem in an infinite dimensional setting, leading to well-posedness in the Hellinger metric with respect to the data. We focus particularly on the reconstruction of binary fields where the interface between different media is the primary unknown. We consider three different prior models - log-Gaussian, star-shaped and level set. Numerical simulations based on the implementation of MCMC are performed, illustrating the advantages and disadvantages of each type of prior in the reconstruction, in the case where the true conductivity is a binary field, and exhibiting the properties of the resulting posterior distribution.
NASep 9, 2016
MAP Estimators for Piecewise Continuous InversionMatthew M. Dunlop, Andrew M. Stuart
We study the inverse problem of estimating a field $u$ from data comprising a finite set of nonlinear functionals of $u$, subject to additive noise; we denote this observed data by $y$. Our interest is in the reconstruction of piecewise continuous fields in which the discontinuity set is described by a finite number of geometric parameters. Natural applications include groundwater flow and electrical impedance tomography. We take a Bayesian approach, placing a prior distribution on $u$ and determining the conditional distribution on $u$ given the data $y$. It is then natural to study maximum a posterior (MAP) estimators. Recently (Dashti et al 2013) it has been shown that MAP estimators can be characterised as minimisers of a generalised Onsager-Machlup functional, in the case where the prior measure is a Gaussian random field. We extend this theory to a more general class of prior distributions which allows for piecewise continuous fields. Specifically, the prior field is assumed to be piecewise Gaussian with random interfaces between the different Gaussians defined by a finite number of parameters. We also make connections with recent work on MAP estimators for linear problems and possibly non-Gaussian priors (Helin, Burger 2015) which employs the notion of Fomin derivative. In showing applicability of our theory we focus on the groundwater flow and EIT models, though the theory holds more generally. Numerical experiments are implemented for the groundwater flow model, demonstrating the feasibility of determining MAP estimators for these piecewise continuous models, but also that the geometric formulation can lead to multiple nearby (local) MAP estimators. We relate these MAP estimators to the behaviour of output from MCMC samples of the posterior, obtained using a state-of-the-art function space Metropolis-Hastings method.
PRFeb 28, 2010
Convergence of Numerical Time-Averaging and Stationary Measures via Poisson EquationsJonathan C. Mattingly, Andrew M. Stuart, M. V. Tretyakov
Numerical approximation of the long time behavior of a stochastic differential equation (SDE) is considered. Error estimates for time-averaging estimators are obtained and then used to show that the stationary behavior of the numerical method converges to that of the SDE. The error analysis is based on using an associated Poisson equation for the underlying SDE. The main advantage of this approach is its simplicity and universality. It works equally well for a range of explicit and implicit schemes including those with simple simulation of random variables, and for hypoelliptic SDEs. To simplify the exposition, we consider only the case where the state space of the SDE is a torus and we study only smooth test functions. However we anticipate that the approach can be applied more widely. An analogy between our approach and Stein's method is indicated. Some practical implications of the results are discussed.
PRJan 25, 2010
Optimal tuning of the Hybrid Monte-Carlo AlgorithmAlexandros Beskos, Natesh S. Pillai, Gareth O. Roberts et al.
We investigate the properties of the Hybrid Monte-Carlo algorithm (HMC) in high dimensions. HMC develops a Markov chain reversible w.r.t. a given target distribution $Π$ by using separable Hamiltonian dynamics with potential $-\logΠ$. The additional momentum variables are chosen at random from the Boltzmann distribution and the continuous-time Hamiltonian dynamics are then discretised using the leapfrog scheme. The induced bias is removed via a Metropolis-Hastings accept/reject rule. In the simplified scenario of independent, identically distributed components, we prove that, to obtain an $\mathcal{O}(1)$ acceptance probability as the dimension $d$ of the state space tends to $\infty$, the leapfrog step-size $h$ should be scaled as $h= l \times d^{-1/4}$. Therefore, in high dimensions, HMC requires $\mathcal{O}(d^{1/4})$ steps to traverse the state space. We also identify analytically the asymptotically optimal acceptance probability, which turns out to be 0.651 (to three decimal places). This is the choice which optimally balances the cost of generating a proposal, which {\em decreases} as $l$ increases, against the cost related to the average number of proposals required to obtain acceptance, which {\em increases} as $l$ increases.