DCMar 10Code
Accelerating High-Order Finite Element Simulations at Extreme Scale with FP64 Tensor CoresJiqun Tu, Ian Karlin, John Camier et al.
Finite element simulations play a critical role in a wide range of applications, from automotive design to tsunami modeling and computational electromagnetics. Performing these simulations efficiently at the high resolutions needed for practical applications and scientific insights necessitates the use of high-order methods and large-scale supercomputing. While much progress has been made in porting finite element codes to GPU systems in recent years, additional improvements in the efficiency and computational speed of GPU-accelerated high-order finite element simulations are in constant demand. In this paper, we demonstrate that the FP64 tensor cores on NVIDIA GPUs can be used to further accelerate such simulations, achieving significant speedups in key kernels of MFEM, a scalable open-source finite element library widely used in HPC applications. By integrating FP64 tensor cores with kernel fusion optimizations, we were able to achieve up to 2$\times$ performance gains and up to 83% energy efficiency gains on NVIDIA's Grace Hopper GH200 and Grace Blackwell GB200 architectures. To the best of our knowledge, this is the first time that FP64 tensor cores have been directly programmed to accelerate large-scale finite element scientific computing applications. We demonstrate the performance of the optimized kernels at exascale by showing near-perfect weak scaling efficiency and 90% strong scaling efficiency across nearly 10,000 GPUs on the Alps system. The new algorithms and MFEM enhancements directly benefit complex production codes, including the 2025 Gordon Bell Prize-winning application for real-time tsunami forecasting.
NAJun 21, 2022
Derivative-Informed Neural Operator: An Efficient Framework for High-Dimensional Parametric Derivative LearningThomas O'Leary-Roseberry, Peng Chen, Umberto Villa et al.
We propose derivative-informed neural operators (DINOs), a general family of neural networks to approximate operators as infinite-dimensional mappings from input function spaces to output function spaces or quantities of interest. After discretizations both inputs and outputs are high-dimensional. We aim to approximate not only the operators with improved accuracy but also their derivatives (Jacobians) with respect to the input function-valued parameter to empower derivative-based algorithms in many applications, e.g., Bayesian inverse problems, optimization under parameter uncertainty, and optimal experimental design. The major difficulties include the computational cost of generating derivative training data and the high dimensionality of the problem leading to large training cost. To address these challenges, we exploit the intrinsic low-dimensionality of the derivatives and develop algorithms for compressing derivative information and efficiently imposing it in neural operator training yielding derivative-informed neural operators. We demonstrate that these advances can significantly reduce the costs of both data generation and training for large classes of problems (e.g., nonlinear steady state parametric PDE maps), making the costs marginal or comparable to the costs without using derivatives, and in particular independent of the discretization dimension of the input and output functions. Moreover, we show that the proposed DINO achieves significantly higher accuracy than neural operators trained without derivative information, for both function approximation and derivative approximation (e.g., Gauss-Newton Hessian), especially when the training data are limited.
NAOct 6, 2022
Residual-based error correction for neural operator accelerated infinite-dimensional Bayesian inverse problemsLianghao Cao, Thomas O'Leary-Roseberry, Prashant K. Jha et al.
We explore using neural operators, or neural network representations of nonlinear maps between function spaces, to accelerate infinite-dimensional Bayesian inverse problems (BIPs) with models governed by nonlinear parametric partial differential equations (PDEs). Neural operators have gained significant attention in recent years for their ability to approximate the parameter-to-solution maps defined by PDEs using as training data solutions of PDEs at a limited number of parameter samples. The computational cost of BIPs can be drastically reduced if the large number of PDE solves required for posterior characterization are replaced with evaluations of trained neural operators. However, reducing error in the resulting BIP solutions via reducing the approximation error of the neural operators in training can be challenging and unreliable. We provide an a priori error bound result that implies certain BIPs can be ill-conditioned to the approximation error of neural operators, thus leading to inaccessible accuracy requirements in training. To reliably deploy neural operators in BIPs, we consider a strategy for enhancing the performance of neural operators, which is to correct the prediction of a trained neural operator by solving a linear variational problem based on the PDE residual. We show that a trained neural operator with error correction can achieve a quadratic reduction of its approximation error, all while retaining substantial computational speedups of posterior sampling when models are governed by highly nonlinear PDEs. The strategy is applied to two numerical examples of BIPs based on a nonlinear reaction--diffusion problem and deformation of hyperelastic materials. We demonstrate that posterior representations of the two BIPs produced using trained neural operators are greatly and consistently enhanced by error correction.
NAAug 16, 2018
A comparative study of structural similarity and regularization for joint inverse problems governed by PDEsBenjamin Crestel, Georg Stadler, Omar Ghattas
Joint inversion refers to the simultaneous inference of multiple parameter fields from observations of systems governed by single or multiple forward models. In many cases these parameter fields reflect different attributes of a single medium and are thus spatially correlated or structurally similar. By imposing prior information on their spatial correlations via a joint regularization term, we seek to improve the reconstruction of the parameter fields relative to inversion for each field independently. One of the main challenges is to devise a joint regularization functional that conveys the spatial correlations or structural similarity between the fields while at the same time permitting scalable and efficient solvers for the joint inverse problem. We describe several joint regularizations that are motivated by these goals: a cross-gradient and a normalized cross-gradient structural similarity term, the vectorial total variation, and a joint regularization based on the nuclear norm of the gradients. Based on numerical results from three classes of inverse problems with piecewise-homogeneous parameter fields, we conclude that the vectorial total variation functional is preferable to the other methods considered. Besides resulting in good reconstructions in all experiments, it allows for scalable, efficient solvers for joint inverse problems governed by PDE forward models.
NAFeb 5, 2019
Scalable matrix-free adaptive product-convolution approximation for locally translation-invariant operatorsNick Alger, Vishwas Rao, Aaron Myers et al.
We present an adaptive grid matrix-free operator approximation scheme based on a "product-convolution" interpolation of convolution operators. This scheme is appropriate for operators that are locally translation-invariant, even if these operators are high-rank or full-rank. Such operators arise in Schur complement methods for solving partial differential equations (PDEs), as Hessians in PDE-constrained optimization and inverse problems, as integral operators, as covariance operators, and as Dirichlet-to-Neumann maps. Constructing the approximation requires computing the impulse responses of the operator to point sources centered on nodes in an adaptively refined grid of sample points. A randomized a-posteriori error estimator drives the adaptivity. Once constructed, the approximation can be efficiently applied to vectors using the fast Fourier transform. The approximation can be efficiently converted to hierarchical matrix ($H$-matrix) format, then inverted or factorized using scalable $H$-matrix arithmetic. The quality of the approximation degrades gracefully as fewer sample points are used, allowing cheap lower quality approximations to be used as preconditioners. This yields an automated method to construct preconditioners for locally translation-invariant Schur complements. We directly address issues related to boundaries and prove that our scheme eliminates boundary artifacts. We test the scheme on a spatially varying blurring kernel, on the non-local component of an interface Schur complement for the Poisson operator, and on the data misfit Hessian for an advection dominated advection-diffusion inverse problem. Numerical results show that the scheme outperforms existing methods.
MLApr 19, 2022
A stochastic Stein Variational Newton methodAlex Leviyev, Joshua Chen, Yifei Wang et al.
Stein variational gradient descent (SVGD) is a general-purpose optimization-based sampling algorithm that has recently exploded in popularity, but is limited by two issues: it is known to produce biased samples, and it can be slow to converge on complicated distributions. A recently proposed stochastic variant of SVGD (sSVGD) addresses the first issue, producing unbiased samples by incorporating a special noise into the SVGD dynamics such that asymptotic convergence is guaranteed. Meanwhile, Stein variational Newton (SVN), a Newton-like extension of SVGD, dramatically accelerates the convergence of SVGD by incorporating Hessian information into the dynamics, but also produces biased samples. In this paper we derive, and provide a practical implementation of, a stochastic variant of SVN (sSVN) which is both asymptotically correct and converges rapidly. We demonstrate the effectiveness of our algorithm on a difficult class of test problems -- the Hybrid Rosenbrock density -- and show that sSVN converges using three orders of magnitude fewer gradient evaluations of the log likelihood than its stochastic SVGD counterpart. Our results show that sSVN is a promising approach to accelerating high-precision Bayesian inference tasks with modest-dimension, $d\sim\mathcal{O}(10)$.
NAMar 13, 2019
Sparse polynomial approximation for optimal control problems constrained by elliptic PDEs with lognormal random coefficientsPeng Chen, Omar Ghattas
In this work, we consider optimal control problems constrained by elliptic partial differential equations (PDEs) with lognormal random coefficients, which are represented by a countably infinite-dimensional random parameter with i.i.d. normal distribution. We approximate the optimal solution by a suitable truncation of its Hermite polynomial chaos expansion, which is known as a sparse polynomial approximation. Based on the convergence analysis in \cite{BachmayrCohenDeVoreEtAl2017} for elliptic PDEs with lognormal random coefficients, we establish the dimension-independent convergence rate of the sparse polynomial approximation of the optimal solution. Moreover, we present a polynomial-based sparse quadrature for the approximation of the expectation of the optimal solution and prove its dimension-independent convergence rate based on the analysis in \cite{Chen2018}. Numerical experiments demonstrate that the convergence of the sparse quadrature error is independent of the active parameter dimensions and can be much faster than that of a Monte Carlo method.
OCMar 3
Shape Derivative-Informed Neural Operators with Application to Risk-Averse Shape OptimizationXindi Gong, Dingcheng Luo, Thomas O'Leary-Roseberry et al.
Shape optimization under uncertainty (OUU) is computationally intensive for classical PDE-based methods due to the high cost of repeated sampling-based risk evaluation across many uncertainty realizations and varying geometries, while standard neural surrogates often fail to provide accurate and efficient sensitivities for optimization. We introduce Shape-DINO, a derivative-informed neural operator framework for learning PDE solution operators on families of varying geometries, with a particular focus on accelerating PDE-constrained shape OUU. Shape-DINOs encode geometric variability through diffeomorphic mappings to a fixed reference domain and employ a derivative-informed operator learning objective that jointly learns the PDE solution and its Fréchet derivatives with respect to design variables and uncertain parameters, enabling accurate state predictions and reliable gradients for large-scale OUU. We establish a priori error bounds linking surrogate accuracy to optimization error and prove universal approximation results for multi-input reduced basis neural operators in suitable $C^1$ norms. We demonstrate efficiency and scalability on three representative shape OUU problems, including boundary design for a Poisson equation and shape design governed by steady-state Navier-Stokes exterior flows in two and three dimensions. Across these examples, Shape-DINOs produce more reliable optimization results than operator surrogates trained without derivative information. In our examples, Shape-DINOs achieve 3-8 orders-of-magnitude speedups in state and gradient evaluations. Counting training data generation, Shape-DINOs reduce necessary PDE solves by 1-2 orders-of-magnitude compared to a strictly PDE-based approach for a single OUU problem. Moreover, Shape-DINO construction costs can be amortized across many objectives and risk measures, enabling large-scale shape OUU for complex systems.
NAMar 22
Tucker Tensor Train Taylor SeriesNick Alger, Blake Christierson, Peng Chen et al.
We present methods for constructing Taylor series surrogate models for covariance preconditioned high dimensional mappings that depend implicitly on the solution of a system of nonlinear equations, e.g., the solution of a partial differential equation. Taylor series are traditionally considered intractable for such mappings because the derivative tensors are enormous, and are only accessible through ``probing'' (contraction of the tensor with vectors in all but one index). We overcome these challenges using a ``Tucker tensor train Taylor series'' (T4S) surrogate model, in which each derivative tensor is approximated by a Tucker decomposition composed with a tensor train. After an initial dimension reduction, Tucker tensor trains are fit to directionally symmetric tensor probes using Riemannian manifold optimization within a rank continuation scheme. The optimization is enabled by fast sweeping methods for applying the Riemannian Jacobian (the Jacobian for the Tucker tensor train fitting problem) and its transpose to vectors. We justify the T4S model theoretically, and provide numerical evidence for the effectiveness of the proposed methods.
GEO-PHMar 16
Real-time probabilistic tsunami forecasting in Cascadia from sparse offshore pressure observationsStefan Henneking, Fabian Kutschera, Sreeram Venkat et al.
Near-field tsunami early warning in the Cascadia Subduction Zone is limited by sparse offshore observations. We show that a hypothetical network of 175 seafloor pressure sensors can support real-time Bayesian inference of tsunamigenic seafloor motion and probabilistic tsunami forecasts for two fully-coupled Cascadia earthquake dynamic rupture--tsunami scenarios, a partial rupture and a margin-wide rupture. The complex oceanic acoustic, Rayleigh, and tsunami wavefields in both scenarios are similar during the first two minutes and then diverge. Using an acoustic--gravity inversion with offline precomputation and online assimilation of pressure data, tsunami forecasts are obtained in less than a second. We leverage a Bayesian inversion-based framework that splits the computations into an offline precomputation phase performed with large-scale computing facilities, and an online phase that computes forecasts from real-time data and can be executed on a laptop. Forecast errors remain low at 22.1% for the margin-wide rupture and 19.6% for the partial rupture.
LGDec 16, 2025
Derivative-Informed Fourier Neural Operator: Universal Approximation and Applications to PDE-Constrained OptimizationBoyuan Yao, Dingcheng Luo, Lianghao Cao et al.
We present approximation theories and efficient training methods for derivative-informed Fourier neural operators (DIFNOs) with applications to PDE-constrained optimization. A DIFNO is an FNO trained by minimizing its prediction error jointly on output and Fréchet derivative samples of a high-fidelity operator (e.g., a parametric PDE solution operator). As a result, a DIFNO can closely emulate not only the high-fidelity operator's response but also its sensitivities. To motivate the use of DIFNOs instead of conventional FNOs as surrogate models, we show that accurate surrogate-driven PDE-constrained optimization requires accurate surrogate Fréchet derivatives. Then, for continuously differentiable operators, we establish (i) simultaneous universal approximation of FNOs and their Fréchet derivatives on compact sets, and (ii) universal approximation of FNOs in weighted Sobolev spaces with input measures that have unbounded supports. Our theoretical results certify the capability of FNOs for accurate derivative-informed operator learning and accurate solution of PDE-constrained optimization. Furthermore, we develop efficient training schemes using dimension reduction and multi-resolution techniques that significantly reduce memory and computational costs for Fréchet derivative learning. Numerical examples on nonlinear diffusion--reaction, Helmholtz, and Navier--Stokes equations demonstrate that DIFNOs are superior in sample complexity for operator learning and solving infinite-dimensional PDE-constrained inverse problems, achieving high accuracy at low training sample sizes.
DCApr 9
Sensor Placement for Tsunami Early Warning via Large-Scale Bayesian Optimal Experimental DesignSreeram Venkat, Stefan Henneking, Omar Ghattas
Real-time tsunami early warning relies on distributed sensor networks to infer seismic sources and seafloor motion. Optimizing these networks via Bayesian optimal experimental design (OED) is exceptionally challenging for systems governed by hyperbolic partial differential equations, which lack the spectral decay required by standard low-rank approximations. We present a scalable Bayesian OED framework for linear time-invariant systems. By reformulating the inverse problem in the data space, we transform OED into dense matrix subset selection. We propose a multi-GPU, Schur-complement-update-based, greedy algorithm that solves the OED problem using a pipelined approach that fully overlaps I/O with GPU computations. Our framework achieves near-perfect weak and strong scaling across hundreds of GPUs on Perlmutter and Frontier. Applied to the 2025 Gordon Bell Prize-winning digital twin for tsunami forecasting in the Cascadia Subduction Zone, we optimize a 175-sensor network, minimizing the uncertainty of a parameter field with over one billion degrees of freedom.
NAMar 13, 2024
Derivative-informed neural operator acceleration of geometric MCMC for infinite-dimensional Bayesian inverse problemsLianghao Cao, Thomas O'Leary-Roseberry, Omar Ghattas
We propose an operator learning approach to accelerate geometric Markov chain Monte Carlo (MCMC) for solving infinite-dimensional Bayesian inverse problems (BIPs). While geometric MCMC employs high-quality proposals that adapt to posterior local geometry, it requires repeated computations of gradients and Hessians of the log-likelihood, which becomes prohibitive when the parameter-to-observable (PtO) map is defined through expensive-to-solve parametric partial differential equations (PDEs). We consider a delayed-acceptance geometric MCMC method driven by a neural operator surrogate of the PtO map, where the proposal exploits fast surrogate predictions of the log-likelihood and, simultaneously, its gradient and Hessian. To achieve a substantial speedup, the surrogate must accurately approximate the PtO map and its Jacobian, which often demands a prohibitively large number of PtO map samples via conventional operator learning methods. In this work, we present an extension of derivative-informed operator learning [O'Leary-Roseberry et al., J. Comput. Phys., 496 (2024)] that uses joint samples of the PtO map and its Jacobian. This leads to derivative-informed neural operator (DINO) surrogates that accurately predict the observables and posterior local geometry at a significantly lower training cost than conventional methods. Cost and error analysis for reduced basis DINO surrogates are provided. Numerical studies demonstrate that DINO-driven MCMC generates effective posterior samples 3--9 times faster than geometric MCMC and 60--97 times faster than prior geometry-based MCMC. Furthermore, the training cost of DINO surrogates breaks even compared to geometric MCMC after just 10--25 effective posterior samples.
NANov 19, 2024
LazyDINO: Fast, scalable, and efficiently amortized Bayesian inversion via structure-exploiting and surrogate-driven measure transportLianghao Cao, Joshua Chen, Michael Brennan et al.
We present LazyDINO, a transport map variational inference method for fast, scalable, and efficiently amortized solutions of high-dimensional nonlinear Bayesian inverse problems with expensive parameter-to-observable (PtO) maps. Our method consists of an offline phase in which we construct a derivative-informed neural surrogate of the PtO map using joint samples of the PtO map and its Jacobian. During the online phase, when given observational data, we seek rapid posterior approximation using surrogate-driven training of a lazy map [Brennan et al., NeurIPS, (2020)], i.e., a structure-exploiting transport map with low-dimensional nonlinearity. The trained lazy map then produces approximate posterior samples or density evaluations. Our surrogate construction is optimized for amortized Bayesian inversion using lazy map variational inference. We show that (i) the derivative-based reduced basis architecture [O'Leary-Roseberry et al., Comput. Methods Appl. Mech. Eng., 388 (2022)] minimizes the upper bound on the expected error in surrogate posterior approximation, and (ii) the derivative-informed training formulation [O'Leary-Roseberry et al., J. Comput. Phys., 496 (2024)] minimizes the expected error due to surrogate-driven transport map optimization. Our numerical results demonstrate that LazyDINO is highly efficient in cost amortization for Bayesian inversion. We observe one to two orders of magnitude reduction of offline cost for accurate posterior approximation, compared to simulation-based amortized inference via conditional transport and conventional surrogate-driven transport. In particular, LazyDINO outperforms Laplace approximation consistently using fewer than 1000 offline samples, while other amortized inference methods struggle and sometimes fail at 16,000 offline samples.
NAApr 11, 2025
Dimension reduction for derivative-informed operator learning: An analysis of approximation errorsDingcheng Luo, Thomas O'Leary-Roseberry, Peng Chen et al.
We study the derivative-informed learning of nonlinear operators between infinite-dimensional separable Hilbert spaces by neural networks. Such operators can arise from the solution of partial differential equations (PDEs), and are used in many simulation-based outer-loop tasks in science and engineering, such as PDE-constrained optimization, Bayesian inverse problems, and optimal experimental design. In these settings, the neural network approximations can be used as surrogate models to accelerate the solution of the outer-loop tasks. However, since outer-loop tasks in infinite dimensions often require knowledge of the underlying geometry, the approximation accuracy of the operator's derivatives can also significantly impact the performance of the surrogate model. Motivated by this, we analyze the approximation errors of neural operators in Sobolev norms over infinite-dimensional Gaussian input measures. We focus on the reduced basis neural operator (RBNO), which uses linear encoders and decoders defined on dominant input/output subspaces spanned by reduced sets of orthonormal bases. To this end, we study two methods for generating the bases; principal component analysis (PCA) and derivative-informed subspaces (DIS), which use the dominant eigenvectors of the covariance of the data or the derivatives as the reduced bases, respectively. We then derive bounds for errors arising from both the dimension reduction and the latent neural network approximation, including the sampling errors associated with the empirical estimation of the PCA/DIS. Our analysis is validated on numerical experiments with elliptic PDEs, where our results show that bases informed by the map (i.e., DIS or output PCA) yield accurate reconstructions and generalization errors for both the operator and its derivatives, while input PCA may underperform unless ranks and training sample sizes are sufficiently large.
OCMay 31, 2023
Efficient PDE-Constrained optimization under high-dimensional uncertainty using derivative-informed neural operatorsDingcheng Luo, Thomas O'Leary-Roseberry, Peng Chen et al.
We propose a novel machine learning framework for solving optimization problems governed by large-scale partial differential equations (PDEs) with high-dimensional random parameters. Such optimization under uncertainty (OUU) problems may be computational prohibitive using classical methods, particularly when a large number of samples is needed to evaluate risk measures at every iteration of an optimization algorithm, where each sample requires the solution of an expensive-to-solve PDE. To address this challenge, we propose a new neural operator approximation of the PDE solution operator that has the combined merits of (1) accurate approximation of not only the map from the joint inputs of random parameters and optimization variables to the PDE state, but also its derivative with respect to the optimization variables, (2) efficient construction of the neural network using reduced basis architectures that are scalable to high-dimensional OUU problems, and (3) requiring only a limited number of training data to achieve high accuracy for both the PDE solution and the OUU solution. We refer to such neural operators as multi-input reduced basis derivative informed neural operators (MR-DINOs). We demonstrate the accuracy and efficiency our approach through several numerical experiments, i.e. the risk-averse control of a semilinear elliptic PDE and the steady state Navier--Stokes equations in two and three spatial dimensions, each involving random field inputs. Across the examples, MR-DINOs offer $10^{3}$--$10^{7} \times$ reductions in execution time, and are able to produce OUU solutions of comparable accuracies to those from standard PDE based solutions while being over $10 \times$ more cost-efficient after factoring in the cost of construction.
LGDec 14, 2021
Learning High-Dimensional Parametric Maps via Reduced Basis Adaptive Residual NetworksThomas O'Leary-Roseberry, Xiaosong Du, Anirban Chaudhuri et al.
We propose a scalable framework for the learning of high-dimensional parametric maps via adaptively constructed residual network (ResNet) maps between reduced bases of the inputs and outputs. When just few training data are available, it is beneficial to have a compact parametrization in order to ameliorate the ill-posedness of the neural network training problem. By linearly restricting high-dimensional maps to informed reduced bases of the inputs, one can compress high-dimensional maps in a constructive way that can be used to detect appropriate basis ranks, equipped with rigorous error estimates. A scalable neural network learning framework is thus to learn the nonlinear compressed reduced basis mapping. Unlike the reduced basis construction, however, neural network constructions are not guaranteed to reduce errors by adding representation power, making it difficult to achieve good practical performance. Inspired by recent approximation theory that connects ResNets to sequential minimizing flows, we present an adaptive ResNet construction algorithm. This algorithm allows for depth-wise enrichment of the neural network approximation, in a manner that can achieve good practical performance by first training a shallow network and then adapting. We prove universal approximation of the associated neural network class for $L^2_ν$ functions on compact sets. Our overall framework allows for constructive means to detect appropriate breadth and depth, and related compact parametrizations of neural networks, significantly reducing the need for architectural hyperparameter tuning. Numerical experiments for parametric PDE problems and a 3D CFD wing design optimization parametric map demonstrate that the proposed methodology can achieve remarkably high accuracy for limited training data, and outperformed other neural network strategies we compared against.
NANov 30, 2020
Derivative-Informed Projected Neural Networks for High-Dimensional Parametric Maps Governed by PDEsThomas O'Leary-Roseberry, Umberto Villa, Peng Chen et al.
Many-query problems, arising from uncertainty quantification, Bayesian inversion, Bayesian optimal experimental design, and optimization under uncertainty-require numerous evaluations of a parameter-to-output map. These evaluations become prohibitive if this parametric map is high-dimensional and involves expensive solution of partial differential equations (PDEs). To tackle this challenge, we propose to construct surrogates for high-dimensional PDE-governed parametric maps in the form of projected neural networks that parsimoniously capture the geometry and intrinsic low-dimensionality of these maps. Specifically, we compute Jacobians of these PDE-based maps, and project the high-dimensional parameters onto a low-dimensional derivative-informed active subspace; we also project the possibly high-dimensional outputs onto their principal subspace. This exploits the fact that many high-dimensional PDE-governed parametric maps can be well-approximated in low-dimensional parameter and output subspace. We use the projection basis vectors in the active subspace as well as the principal output subspace to construct the weights for the first and last layers of the neural network, respectively. This frees us to train the weights in only the low-dimensional layers of the neural network. The architecture of the resulting neural network captures to first order, the low-dimensional structure and geometry of the parametric map. We demonstrate that the proposed projected neural network achieves greater generalization accuracy than a full neural network, especially in the limited training data regime afforded by expensive PDE-based parametric maps. Moreover, we show that the number of degrees of freedom of the inner layers of the projected network is independent of the parameter and output dimensions, and high accuracy can be achieved with weight dimension independent of the discretization dimension.
LGFeb 9, 2020
Projected Stein Variational Gradient DescentPeng Chen, Omar Ghattas
The curse of dimensionality is a longstanding challenge in Bayesian inference in high dimensions. In this work, we propose a projected Stein variational gradient descent (pSVGD) method to overcome this challenge by exploiting the fundamental property of intrinsic low dimensionality of the data informed subspace stemming from ill-posedness of such problems. We adaptively construct the subspace using a gradient information matrix of the log-likelihood, and apply pSVGD to the much lower-dimensional coefficients of the parameter projection. The method is demonstrated to be more accurate and efficient than SVGD. It is also shown to be more scalable with respect to the number of parameters, samples, data points, and processor cores via experiments with parameters dimensions ranging from the hundreds to the tens of thousands.
OCFeb 7, 2020
Ill-Posedness and Optimization Geometry for Nonlinear Neural Network TrainingThomas O'Leary-Roseberry, Omar Ghattas
In this work we analyze the role nonlinear activation functions play at stationary points of dense neural network training problems. We consider a generic least squares loss function training formulation. We show that the nonlinear activation functions used in the network construction play a critical role in classifying stationary points of the loss landscape. We show that for shallow dense networks, the nonlinear activation function determines the Hessian nullspace in the vicinity of global minima (if they exist), and therefore determines the ill-posedness of the training problem. Furthermore, for shallow nonlinear networks we show that the zeros of the activation function and its derivatives can lead to spurious local minima, and discuss conditions for strict saddle points. We extend these results to deep dense neural networks, showing that the last activation function plays an important role in classifying stationary points, due to how it shows up in the gradient from the chain rule.
OCFeb 7, 2020
Low Rank Saddle Free Newton: A Scalable Method for Stochastic Nonconvex OptimizationThomas O'Leary-Roseberry, Nick Alger, Omar Ghattas
In modern deep learning, highly subsampled stochastic approximation (SA) methods are preferred to sample average approximation (SAA) methods because of large data sets as well as generalization properties. Additionally, due to perceived costs of forming and factorizing Hessians, second order methods are not used for these problems. In this work we motivate the extension of Newton methods to the SA regime, and argue for the use of the scalable low rank saddle free Newton (LRSFN) method, which avoids forming the Hessian in favor of making a low rank approximation. Additionally, LRSFN can facilitate fast escape from indefinite regions leading to better optimization solutions. In the SA setting, iterative updates are dominated by stochastic noise, and stability of the method is key. We introduce a continuous time stability analysis framework, and use it to demonstrate that stochastic errors for Newton methods can be greatly amplified by ill-conditioned Hessians. The LRSFN method mitigates this stability issue via Levenberg-Marquardt damping. However, generally the analysis shows that second order methods with stochastic Hessian and gradient information may need to take small steps, unlike in deterministic problems. Numerical results show that LRSFN can escape indefinite regions that other methods have issues with; and even under restrictive step length conditions, LRSFN can outperform popular first order methods on large scale deep learning tasks in terms of generalizability for equivalent computational work.
NASep 26, 2018
Sparse polynomial approximations for affine parametric saddle point problemsPeng Chen, Omar Ghattas
In this work we study convergence properties of sparse polynomial approximations for a class of affine parametric saddle point problems. Such problems can be found in many computational science and engineering fields, including the Stokes equations for viscous incompressible flow, mixed formulation of diffusion equations for heat conduction or groundwater flow, time-harmonic Maxwell equations for electromagnetics, etc. Due to the lack of knowledge or intrinsic randomness, the coefficients of such problems are uncertain and can often be represented or approximated by high- or countably infinite-dimensional random parameters equipped with suitable probability distributions, and the coefficients affinely depend on a series of either globally or locally supported basis functions, e.g., Karhunen--Loève expansion, piecewise polynomials, or adaptive wavelet approximations. Consequently, we are faced with solving affine parametric saddle point problems. Here we study sparse polynomial approximations of the parametric solutions, in particular sparse Taylor approximations, and their convergence properties for these parametric problems. With suitable sparsity assumptions on the parametrization, we obtain the algebraic convergence rates $O(N^{-r})$ for the sparse polynomial approximations of the parametric solutions, in cases of both globally and locally supported basis functions. We prove that $r$ depends only on a sparsity parameter in the parametrization of the random input, and in particular does not depend on the number of active parameter dimensions or the number of polynomial terms $N$. These results imply that sparse polynomial approximations can effectively break the curse of dimensionality, thereby establishing a theoretical foundation for the development and application of such practical algorithms as adaptive, least-squares, and compressive sensing constructions.
NASep 26, 2018
Hessian-based sampling for high-dimensional model reductionPeng Chen, Omar Ghattas
In this work we develop a Hessian-based sampling method for the construction of goal-oriented reduced order models with high-dimensional parameter inputs. Model reduction is known very challenging for high-dimensional parametric problems whose solutions also live in high-dimensional manifolds. However, the manifold of some quantity of interest (QoI) depending on the parametric solutions may be low-dimensional. We use the Hessian of the QoI with respect to the parameter to detect this low-dimensionality, and draw training samples by projecting the high-dimensional parameter to a low-dimensional subspace spanned by the eigenvectors of the Hessian corresponding to its dominating eigenvalues. Instead of forming the full Hessian, which is computationally intractable for a high-dimensional parameter, we employ a randomized algorithm to efficiently compute the dominating eigenpairs of the Hessian whose cost does not depend on the nominal dimension of the parameter but only on the intrinsic dimension of the QoI. We demonstrate that the Hessian-based sampling leads to much smaller errors of the reduced basis approximation for the QoI compared to a random sampling for a diffusion equation with random input obeying either uniform or Gaussian distributions.
NAAug 2, 2017
A data scalable augmented Lagrangian KKT preconditioner for large scale inverse problemsNick Alger, Umberto Villa, Tan Bui-Thanh et al.
Current state of the art preconditioners for the reduced Hessian and the Karush-Kuhn-Tucker (KKT) operator for large scale inverse problems are typically based on approximating the reduced Hessian with the regularization operator. However, the quality of this approximation degrades with increasingly informative observations or data. Thus the best case scenario from a scientific standpoint (fully informative data) is the worse case scenario from a computational perspective. In this paper we present an augmented Lagrangian-type preconditioner based on a block diagonal approximation of the augmented upper left block of the KKT operator. The preconditioner requires solvers for two linear subproblems that arise in the augmented KKT operator, which we expect to be much easier to precondition than the reduced Hessian. Analysis of the spectrum of the preconditioned KKT operator indicates that the preconditioner is effective when the regularization is chosen appropriately. In particular, it is effective when the regularization does not over-penalize highly informed parameter modes and does not under-penalize uninformed modes. Finally, we present a numerical study for a large data/low noise Poisson source inversion problem, demonstrating the effectiveness of the preconditioner. In this example, three MINRES iterations on the KKT system with our preconditioner results in a reconstruction with better accuracy than 50 iterations of CG on the reduced Hessian system with regularization preconditioning.
NAJun 20, 2017
Hessian-based adaptive sparse quadrature for infinite-dimensional Bayesian inverse problemsPeng Chen, Umberto Villa, Omar Ghattas
In this work we propose and analyze a Hessian-based adaptive sparse quadrature to compute infinite-dimensional integrals with respect to the posterior distribution in the context of Bayesian inverse problems with Gaussian prior. Due to the concentration of the posterior distribution in the domain of the prior distribution, a prior-based parametrization and sparse quadrature may fail to capture the posterior distribution and lead to erroneous evaluation results. By using a parametrization based on the Hessian of the negative log-posterior, the adaptive sparse quadrature can effectively allocate the quadrature points according to the posterior distribution. A dimension-independent convergence rate of the proposed method is established under certain assumptions on the Gaussian prior and the integrands. Dimension-independent and faster convergence than $O(N^{-1/2})$ is demonstrated for a linear as well as a nonlinear inverse problem whose posterior distribution can be effectively approximated by a Gaussian distribution at the MAP point.