NASep 12, 2012
Strong convergence of an explicit numerical method for SDEs with nonglobally Lipschitz continuous coefficientsMartin Hutzenthaler, Arnulf Jentzen, Peter E. Kloeden
On the one hand, the explicit Euler scheme fails to converge strongly to the exact solution of a stochastic differential equation (SDE) with a superlinearly growing and globally one-sided Lipschitz continuous drift coefficient. On the other hand, the implicit Euler scheme is known to converge strongly to the exact solution of such an SDE. Implementations of the implicit Euler scheme, however, require additional computational effort. In this article we therefore propose an explicit and easily implementable numerical method for such an SDE and show that this method converges strongly with the standard order one-half to the exact solution of the SDE. Simulations reveal that this explicit strongly convergent numerical scheme is considerably faster than the implicit Euler scheme.
PRMay 9, 2013
Numerical approximations of stochastic differential equations with non-globally Lipschitz continuous coefficientsMartin Hutzenthaler, Arnulf Jentzen
Many stochastic differential equations (SDEs) in the literature have a superlinearly growing nonlinearity in their drift or diffusion coefficient. Unfortunately, moments of the computationally efficient Euler-Maruyama approximation method diverge for these SDEs in finite time. This article develops a general theory based on rare events for studying integrability properties such as moment bounds for discrete-time stochastic processes. Using this approach, we establish moment bounds for fully and partially drift-implicit Euler methods and for a class of new explicit approximation methods which require only a few more arithmetical operations than the Euler-Maruyama method. These moment bounds are then used to prove strong convergence of the proposed schemes. Finally, we illustrate our results for several SDEs from finance, physics, biology and chemistry.
NAFeb 22, 2019
Multilevel Picard iterations for solving smooth semilinear parabolic heat equationsWeinan E, Martin Hutzenthaler, Arnulf Jentzen et al.
We introduce a new family of numerical algorithms for approximating solutions of general high-dimensional semilinear parabolic partial differential equations at single space-time points. The algorithm is obtained through a delicate combination of the Feynman-Kac and the Bismut-Elworthy-Li formulas, and an approximate decomposition of the Picard fixed-point iteration with multilevel accuracy. The algorithm has been tested on a variety of semilinear partial differential equations that arise in physics and finance, with very satisfactory results. Analytical tools needed for the analysis of such algorithms, including a semilinear Feynman-Kac formula, a new class of semi-norms and their recursive inequalities, are also introduced. They allow us to prove for semilinear heat equations with gradient-independent nonlinearity that the computational complexity of the proposed algorithm is bounded by $O(d\,\varepsilon^{-(4+δ)})$ for any $δ\in (0,\infty)$ under suitable assumptions, where $d\in \mathbb{N}$ is the dimensionality of the problem and $\varepsilon\in(0,\infty)$ is the prescribed accuracy.
PRSep 10, 2013
Divergence of the multilevel Monte Carlo Euler method for nonlinear stochastic differential equationsMartin Hutzenthaler, Arnulf Jentzen, Peter E. Kloeden
The Euler-Maruyama scheme is known to diverge strongly and numerically weakly when applied to nonlinear stochastic differential equations (SDEs) with superlinearly growing and globally one-sided Lipschitz continuous drift coefficients. Classical Monte Carlo simulations do, however, not suffer from this divergence behavior of Euler's method because this divergence behavior happens on rare events. Indeed, for such nonlinear SDEs the classical Monte Carlo Euler method has been shown to converge by exploiting that the Euler approximations diverge only on events whose probabilities decay to zero very rapidly. Significantly more efficient than the classical Monte Carlo Euler method is the recently introduced multilevel Monte Carlo Euler method. The main observation of this article is that this multilevel Monte Carlo Euler method does - in contrast to classical Monte Carlo methods - not converge in general in the case of such nonlinear SDEs. More precisely, we establish divergence of the multilevel Monte Carlo Euler method for a family of SDEs with superlinearly growing and globally one-sided Lipschitz continuous drift coefficients. In particular, the multilevel Monte Carlo Euler method diverges for these nonlinear SDEs on an event that is not at all rare but has probability one. As a consequence for applications, we recommend not to use the multilevel Monte Carlo Euler method for SDEs with superlinearly growing nonlinearities. Instead we propose to combine the multilevel Monte Carlo method with a slightly modified Euler method. More precisely, we show that the multilevel Monte Carlo method combined with a tamed Euler method converges for nonlinear SDEs with globally one-sided Lipschitz continuous drift coefficients and preserves its strikingly higher order convergence rate from the Lipschitz case.
LGFeb 10Code
Physics-informed diffusion models in spectral spaceDavide Gallon, Philippe von Wurstemberger, Patrick Cheridito et al.
We propose a methodology that combines generative latent diffusion models with physics-informed machine learning to generate solutions of parametric partial differential equations (PDEs) conditioned on partial observations, which includes, in particular, forward and inverse PDE problems. We learn the joint distribution of PDE parameters and solutions via a diffusion process in a latent space of scaled spectral representations, where Gaussian noise corresponds to functions with controlled regularity. This spectral formulation enables significant dimensionality reduction compared to grid-based diffusion models and ensures that the induced process in function space remains within a class of functions for which the PDE operators are well defined. Building on diffusion posterior sampling, we enforce physics-informed constraints and measurement conditions during inference, applying Adam-based updates at each diffusion step. We evaluate the proposed approach on Poisson, Helmholtz, and incompressible Navier--Stokes equations, demonstrating improved accuracy and computational efficiency compared with existing diffusion-based PDE solvers, which are state of the art for sparse observations. Code is available at https://github.com/deeplearningmethods/PISD.
PROct 31, 2012
Efficient simulation of nonlinear parabolic SPDEs with additive noiseArnulf Jentzen, Peter Kloeden, Georg Winkel
Recently, in a paper by Jentzen and Kloeden [Proc. R. Soc. Lond. Ser. A Math. Phys. Eng. Sci. 465 (2009) 649-667], a new method for simulating nearly linear stochastic partial differential equations (SPDEs) with additive noise has been introduced. The key idea was to use suitable linear functionals of the noise process in the numerical scheme which allow a higher approximation order to be obtained. Following this approach, a new simplified version of the scheme in the above named reference is proposed and analyzed in this article. The main advantage of the convergence result given here is the higher convergence order for nonlinear parabolic SPDEs with additive noise, although the used numerical scheme is very simple to simulate and implement.
PRNov 7, 2017
Strong convergence rates for explicit space-time discrete numerical approximations of stochastic Allen-Cahn equationsSebastian Becker, Benjamin Gess, Arnulf Jentzen et al.
The scientific literature contains a number of numerical approximation results for stochastic partial differential equations (SPDEs) with superlinearly growing nonlinearities but, to the best of our knowledge, none of them prove strong or weak convergence rates for full-discrete numerical approximations of space-time white noise driven SPDEs with superlinearly growing nonlinearities. In particular, in the scientific literature there exists neither a result which proves strong convergence rates nor a result which proves weak convergence rates for full-discrete numerical approximations of stochastic Allen-Cahn equations. In this article we bridge this gap and establish strong convergence rates for full-discrete numerical approximations of space-time white noise driven SPDEs with superlinearly growing nonlinearities such as stochastic Allen-Cahn equations. Moreover, we also establish lower bounds for strong temporal and spatial approximation errors which demonstrate that our strong convergence rates are essentially sharp and can, in general, not be improved.
PRJan 21, 2016
Strong convergence rates for nonlinearity-truncated Euler-type approximations of stochastic Ginzburg-Landau equationsSebastian Becker, Arnulf Jentzen
This article proposes and analyzes explicit and easily implementable temporal numerical approximation schemes for additive noise-driven stochastic partial differential equations (SPDEs) with polynomial nonlinearities such as, e.g., stochastic Ginzburg-Landau equations. We prove essentially sharp strong convergence rates for the considered approximation schemes. Our analysis is carried out for abstract stochastic evolution equations on separable Banach and Hilbert spaces including the above mentioned SPDEs as special cases. We also illustrate our strong convergence rate results by means of a numerical simulation in Matlab.
NANov 20, 2016
Exponential integrability properties of numerical approximation processes for nonlinear stochastic differential equationsMartin Hutzenthaler, Arnulf Jentzen, Xiaojie Wang
Exponential integrability properties of numerical approximations are a key tool for establishing positive rates of strong and numerically weak convergence for a large class of nonlinear stochastic differential equations. It turns out that well-known numerical approximation processes such as Euler-Maruyama approximations, linear-implicit Euler approximations, and some tamed Euler approximations from the literature rarely preserve exponential integrability properties of the exact solution. The main contribution of this article is to identify a class of stopped increment-tamed Euler approximations which preserve exponential integrability properties of the exact solution under minor additional assumptions on the involved functions.
NAMar 14, 2019
Overcoming the curse of dimensionality in the approximative pricing of financial derivatives with default risksMartin Hutzenthaler, Arnulf Jentzen, Philippe von Wurstemberger
Parabolic partial differential equations (PDEs) are widely used in the mathematical modeling of natural phenomena and man made complex systems. In particular, parabolic PDEs are a fundamental tool to determine fair prices of financial derivatives in the financial industry. The PDEs appearing in financial engineering applications are often nonlinear and high dimensional since the dimension typically corresponds to the number of considered financial assets. A major issue is that most approximation methods for nonlinear PDEs in the literature suffer under the so-called curse of dimensionality in the sense that the computational effort to compute an approximation with a prescribed accuracy grows exponentially in the dimension of the PDE or in the reciprocal of the prescribed approximation accuracy and nearly all approximation methods have not been shown not to suffer under the curse of dimensionality. Recently, a new class of approximation schemes for semilinear parabolic PDEs, termed full history recursive multilevel Picard (MLP) algorithms, were introduced and it was proven that MLP algorithms do overcome the curse of dimensionality for semilinear heat equations. In this paper we extend those findings to a more general class of semilinear PDEs including as special cases semilinear Black-Scholes equations used for the pricing of financial derivatives with default risks. More specifically, we introduce an MLP algorithm for the approximation of solutions of semilinear Black-Scholes equations and prove that the computational effort of our method grows at most polynomially both in the dimension and the reciprocal of the prescribed approximation accuracy. This is, to the best of our knowledge, the first result showing that the approximation of solutions of semilinear Black-Scholes equations is a polynomially tractable approximation problem.
NAJan 29, 2018
Strong error analysis for stochastic gradient descent optimization algorithmsArnulf Jentzen, Benno Kuckuck, Ariel Neufeld et al.
Stochastic gradient descent (SGD) optimization algorithms are key ingredients in a series of machine learning applications. In this article we perform a rigorous strong error analysis for SGD optimization algorithms. In particular, we prove for every arbitrarily small $\varepsilon \in (0,\infty)$ and every arbitrarily large $p\in (0,\infty)$ that the considered SGD optimization algorithm converges in the strong $L^p$-sense with order $\frac{1}{2}-\varepsilon$ to the global minimum of the objective function of the considered stochastic approximation problem under standard convexity-type assumptions on the objective function and relaxed assumptions on the moments of the stochastic errors appearing in the employed SGD optimization algorithm. The key ideas in our convergence proof are, first, to employ techniques from the theory of Lyapunov-type functions for dynamical systems to develop a general convergence machinery for SGD optimization algorithms based on such functions, then, to apply this general machinery to concrete Lyapunov-type functions with polynomial structures, and, thereafter, to perform an induction argument along the powers appearing in the Lyapunov-type functions in order to achieve for every arbitrarily large $ p \in (0,\infty) $ strong $ L^p $-convergence rates. This article also contains an extensive review of results on SGD optimization algorithms in the scientific literature.
NAApr 28
Deep neural network approximation theory for high-dimensional functionsPierfrancesco Beneventano, Patrick Cheridito, Robin Graeber et al.
The purpose of this article is to develop a machinery to study the capacity of deep neural networks (DNNs) to approximate high-dimensional functions. In particular, we show that DNNs have the expressive power to overcome the curse of dimensionality in the approximation of a large class of functions. More precisely, we prove that these functions can be approximated by DNNs on compact sets such that the number of parameters necessary to represent the approximating DNNs grows at most polynomially in the reciprocal $1/\varepsilon$ of the prescribed approximation error $\varepsilon>0$ and in the input dimension $d\in\mathbb N$. To this end, we introduce certain approximation spaces, consisting of sequences of functions that can be efficiently approximated by DNNs. We then establish closure properties which we combine with known and new bounds on the number of parameters necessary to approximate locally Lipschitz continuous functions, maximum functions, and product functions by DNNs. The main result of this article demonstrates that DNNs have sufficient expressive power to approximate, without the curse of dimensionality, certain sequences of functions which can be constructed by means of a finite number of compositions using locally Lipschitz continuous functions, maxima, and products.
NAOct 19, 2017
Strong convergence for explicit space-time discrete numerical approximation methods for stochastic Burgers equationsArnulf Jentzen, Diyora Salimova, Timo Welti
In this paper we propose and analyze explicit space-time discrete numerical approximations for additive space-time white noise driven stochastic partial differential equations (SPDEs) with non-globally monotone nonlinearities such as the stochastic Burgers equation with space-time white noise. The main result of this paper proves that the proposed explicit space-time discrete approximation method converges strongly to the solution process of the stochastic Burgers equation with space-time white noise. To the best of our knowledge, the main result of this work is the first result in the literature which establishes strong convergence for a space-time discrete approximation method in the case of the stochastic Burgers equations with space-time white noise.
NANov 6, 2017
On arbitrarily slow convergence rates for strong numerical approximations of Cox-Ingersoll-Ross processes and squared Bessel processesMario Hefter, Arnulf Jentzen
Cox-Ingersoll-Ross (CIR) processes are extensively used in state-of-the-art models for the approximative pricing of financial derivatives. In particular, CIR processes are day after day employed to model instantaneous variances (squared volatilities) of foreign exchange rates and stock prices in Heston-type models and they are also intensively used to model short-rate interest rates. The prices of the financial derivatives in the above mentioned models are very often approximately computed by means of explicit or implicit Euler- or Milstein-type discretization methods based on equidistant evaluations of the driving noise processes. In this article we study the strong convergence speeds of all such discretization methods. More specifically, the main result of this article reveals that each such discretization method achieves at most a strong convergence order of $δ/2$, where $0<δ<2$ is the dimension of the squared Bessel process associated to the considered CIR process. In particular, we thereby reveal that discretization methods currently employed in the financial industry may converge with arbitrarily slow strong convergence rates to the solution of the considered CIR process. We thereby lay open the need of the development of other more sophisticated approximation methods which are capable to solve CIR processes in the strong sense in a reasonable computational time and which thus can not belong to the class of algorithms which use equidistant evaluations of the driving noise processes.
NAMar 14, 2019
Strong and weak divergence of exponential and linear-implicit Euler approximations for stochastic partial differential equations with superlinearly growing nonlinearitiesMatteo Beccari, Martin Hutzenthaler, Arnulf Jentzen et al.
The explicit Euler scheme and similar explicit approximation schemes (such as the Milstein scheme) are known to diverge strongly and numerically weakly in the case of one-dimensional stochastic ordinary differential equations with superlinearly growing nonlinearities. It remained an open question whether such a divergence phenomenon also holds in the case of stochastic partial differential equations with superlinearly growing nonlinearities such as stochastic Allen-Cahn equations. In this work we solve this problem by proving that full-discrete exponential Euler and full-discrete linear-implicit Euler approximations diverge strongly and numerically weakly in the case of stochastic Allen-Cahn equations. This article also contains a short literature overview on existing numerical approximation results for stochastic differential equations with superlinearly growing nonlinearities.
LGOct 31, 2023
Mathematical Introduction to Deep Learning: Methods, Implementations, and TheoryArnulf Jentzen, Benno Kuckuck, Philippe von Wurstemberger
This book aims to provide an introduction to the topic of deep learning algorithms. We review essential components of deep learning algorithms in full mathematical detail including different artificial neural network (ANN) architectures (such as fully-connected feedforward ANNs, convolutional ANNs, recurrent ANNs, residual ANNs, and ANNs with batch normalization) and different optimization algorithms (such as the basic stochastic gradient descent (SGD) method, accelerated methods, and adaptive methods). We also cover several theoretical aspects of deep learning algorithms such as approximation capacities of ANNs (including a calculus for ANNs), optimization theory (including Kurdyka-Łojasiewicz inequalities), and generalization errors. In the last part of the book some deep learning approximation methods for PDEs are reviewed including physics-informed neural networks (PINNs) and deep Galerkin methods. We hope that this book will be useful for students and scientists who do not yet have any background in deep learning at all and would like to gain a solid foundation as well as for practitioners who would like to obtain a firmer mathematical understanding of the objects and methods considered in deep learning.
NAMay 7, 2022
Deep learning approximations for non-local nonlinear PDEs with Neumann boundary conditionsVictor Boussange, Sebastian Becker, Arnulf Jentzen et al.
Nonlinear partial differential equations (PDEs) are used to model dynamical processes in a large number of scientific fields, ranging from finance to biology. In many applications standard local models are not sufficient to accurately account for certain non-local phenomena such as, e.g., interactions at a distance. In order to properly capture these phenomena non-local nonlinear PDE models are frequently employed in the literature. In this article we propose two numerical methods based on machine learning and on Picard iterations, respectively, to approximately solve non-local nonlinear PDEs. The proposed machine learning-based method is an extended variant of a deep learning-based splitting-up type approximation method previously introduced in the literature and utilizes neural networks to provide approximate solutions on a subset of the spatial domain of the solution. The Picard iterations-based method is an extended variant of the so-called full history recursive multilevel Picard approximation scheme previously introduced in the literature and provides an approximate solution for a single point of the domain. Both methods are mesh-free and allow non-local nonlinear PDEs with Neumann boundary conditions to be solved in high dimensions. In the two methods, the numerical difficulties arising due to the dimensionality of the PDEs are avoided by (i) using the correspondence between the expected trajectory of reflected stochastic processes and the solution of PDEs (given by the Feynman-Kac formula) and by (ii) using a plain vanilla Monte Carlo integration to handle the non-local term. We evaluate the performance of the two methods on five different PDEs arising in physics and biology. In all cases, the methods yield good results in up to 10 dimensions with short run times. Our work extends recently developed methods to overcome the curse of dimensionality in solving PDEs.
PRDec 9, 2016
Weak convergence rates for numerical approximations of stochastic partial differential equations with nonlinear diffusion coefficients in UMD Banach spacesMario Hefter, Arnulf Jentzen, Ryan Kurniawan
Strong convergence rates for numerical approximations of semilinear stochastic partial differential equations (SPDEs) with smooth and regular nonlinearities are well understood in the literature. Weak convergence rates for numerical approximations of such SPDEs have been investigated for about two decades and are still not yet fully understood. In particular, no essentially sharp weak convergence rates are known for temporal or spatial numerical approximations of space-time white noise driven SPDEs with nonlinear multiplication operators in the diffusion coefficients. In this article we overcome this problem by establishing essentially sharp weak convergence rates for exponential Euler approximations of semilinear SPDEs with nonlinear multiplication operators in the diffusion coefficients. Key ingredients of our approach are applications of the mild Itô type formula in UMD Banach spaces with type 2.
OCJul 29, 2024
Convergence rates for the Adam optimizerSteffen Dereich, Arnulf Jentzen
Stochastic gradient descent (SGD) optimization methods are nowadays the method of choice for the training of deep neural networks (DNNs) in artificial intelligence systems. In practically relevant training problems, usually not the plain vanilla standard SGD method is the employed optimization scheme but instead suitably accelerated and adaptive SGD optimization methods are applied. As of today, maybe the most popular variant of such accelerated and adaptive SGD optimization methods is the famous Adam optimizer proposed by Kingma & Ba in 2014. Despite the popularity of the Adam optimizer in implementations, it remained an open problem of research to provide a convergence analysis for the Adam optimizer even in the situation of simple quadratic stochastic optimization problems where the objective function (the function one intends to minimize) is strongly convex. In this work we solve this problem by establishing optimal convergence rates for the Adam optimizer for a large class of stochastic optimization problems, in particular, covering simple quadratic stochastic optimization problems. The key ingredient of our convergence analysis is a new vector field function which we propose to refer to as the Adam vector field. This Adam vector field accurately describes the macroscopic behaviour of the Adam optimization process but differs from the negative gradient of the objective function (the function we intend to minimize) of the considered stochastic optimization problem. In particular, our convergence analysis reveals that the Adam optimizer does typically not converge to critical points of the objective function (zeros of the gradient of the objective function) of the considered optimization problem but converges with rates to zeros of this Adam vector field.
NAApr 30
Algorithmically Designed Artificial Neural Networks (ADANNs): Higher order deep operator learning for parametric partial differential equationsArnulf Jentzen, Adrian Riekert, Philippe von Wurstemberger
In this article we propose a new deep learning approach to approximate operators related to parametric partial differential equations (PDEs). In particular, we introduce a new strategy to design specific artificial neural network (ANN) architectures in conjunction with specific ANN initialization schemes which are tailor-made for the particular approximation problem under consideration. In the proposed approach we combine efficient classical numerical approximation techniques with deep operator learning methodologies. Specifically, we introduce customized adaptions of existing ANN architectures together with specialized initializations for these ANN architectures so that at initialization we have that the ANNs closely mimic a chosen efficient classical numerical algorithm for the considered approximation problem. The obtained ANN architectures and their initialization schemes are thus strongly inspired by numerical algorithms as well as by popular deep learning methodologies from the literature and in that sense we refer to the introduced ANNs in conjunction with their tailor-made initialization schemes as Algorithmically Designed Artificial Neural Networks (ADANNs). We numerically test the proposed ADANN methodology in the case of several parametric PDEs. In the tested numerical examples the ADANN methodology significantly outperforms existing classical approximation algorithms as well as existing deep operator learning methodologies from the literature.
PRJan 16, 2017
Lower bounds for weak approximation errors for spatial spectral Galerkin approximations of stochastic wave equationsLadislas Jacobe de Naurois, Arnulf Jentzen, Timo Welti
Although for a number of semilinear stochastic wave equations existence and uniqueness results for corresponding solution processes are known from the literature, these solution processes are typically not explicitly known and numerical approximation methods are needed in order for mathematical modelling with stochastic wave equations to become relevant for real world applications. This, in turn, requires the numerical analysis of convergence rates for such numerical approximation processes. A recent article by the authors proves upper bounds for weak errors for spatial spectral Galerkin approximations of a class of semilinear stochastic wave equations. The findings there are complemented by the main result of this work, that provides lower bounds for weak errors which show that in the general framework considered the established upper bounds can essentially not be improved.
LGAug 3, 2022
Gradient descent provably escapes saddle points in the training of shallow ReLU networksPatrick Cheridito, Arnulf Jentzen, Florian Rossmannek
Dynamical systems theory has recently been applied in optimization to prove that gradient descent algorithms bypass so-called strict saddle points of the loss function. However, in many modern machine learning applications, the required regularity conditions are not satisfied. In this paper, we prove a variant of the relevant dynamical systems result, a center-stable manifold theorem, in which we relax some of the regularity requirements. We explore its relevance for various machine learning tasks, with a particular focus on shallow rectified linear unit (ReLU) and leaky ReLU networks with scalar input. Building on a detailed examination of critical points of the square integral loss function for shallow ReLU and leaky ReLU networks relative to an affine target function, we show that gradient descent circumvents most saddle points. Furthermore, we prove convergence to global minima under favourable initialization conditions, quantified by an explicit threshold on the limiting loss.
OCFeb 28, 2023
On the existence of minimizers in shallow residual ReLU neural network optimization landscapesSteffen Dereich, Arnulf Jentzen, Sebastian Kassing
In this article, we show existence of minimizers in the loss landscape for residual artificial neural networks (ANNs) with multi-dimensional input layer and one hidden layer with ReLU activation. Our work contrasts earlier results in [D. Gallon, A. Jentzen, and F. Lindner, preprint, arXiv:2211.15641, 2022] and [P. Petersen, M. Raslan, and F. Voigtlaender, Found. Comput. Math., 21 (2021), pp. 375-444] which showed that in many situations minimizers do not exist for common smooth activation functions even in the case where the target functions are polynomials. The proof of the existence property makes use of a closure of the search space containing all functions generated by ANNs and additional discontinuous generalized responses. As we will show, the additional generalized responses in this larger space are suboptimal so that the minimum is attained in the original function class.
LGJul 11, 2024
Non-convergence of Adam and other adaptive stochastic gradient descent optimization methods for non-vanishing learning ratesSteffen Dereich, Robin Graeber, Arnulf Jentzen
Deep learning algorithms - typically consisting of a class of deep neural networks trained by a stochastic gradient descent (SGD) optimization method - are nowadays the key ingredients in many artificial intelligence (AI) systems and have revolutionized our ways of working and living in modern societies. For example, SGD methods are used to train powerful large language models (LLMs) such as versions of ChatGPT and Gemini, SGD methods are employed to create successful generative AI based text-to-image creation models such as Midjourney, DALL-E, and Stable Diffusion, but SGD methods are also used to train DNNs to approximately solve scientific models such as partial differential equation (PDE) models from physics and biology and optimal control and stopping problems from engineering. It is known that the plain vanilla standard SGD method fails to converge even in the situation of several convex optimization problems if the learning rates are bounded away from zero. However, in many practical relevant training scenarios, often not the plain vanilla standard SGD method but instead adaptive SGD methods such as the RMSprop and the Adam optimizers, in which the learning rates are modified adaptively during the training process, are employed. This naturally rises the question whether such adaptive optimizers, in which the learning rates are modified adaptively during the training process, do converge in the situation of non-vanishing learning rates. In this work we answer this question negatively by proving that adaptive SGD methods such as the popular Adam optimizer fail to converge to any possible random limit point if the learning rates are asymptotically bounded away from zero. In our proof of this non-convergence result we establish suitable pathwise a priori bounds for a class of accelerated and adaptive SGD methods, which are also of independent interest.
LGMar 19
Uniform a priori bounds and error analysis for the Adam stochastic gradient descent optimization methodSteffen Dereich, Thang Do, Arnulf Jentzen
The adaptive moment estimation (Adam) optimizer proposed by Kingma & Ba (2014) is presumably the most popular stochastic gradient descent (SGD) optimization method for the training of deep neural networks (DNNs) in artificial intelligence (AI) systems. Despite its groundbreaking success in the training of AI systems, it still remains an open research problem to provide a complete error analysis of Adam, not only for optimizing DNNs but even when applied to strongly convex stochastic optimization problems (SOPs). Previous error analysis results for strongly convex SOPs in the literature provide conditional convergence analyses that rely on the assumption that Adam does not diverge to infinity but remains uniformly bounded. It is the key contribution of this work to establish uniform a priori bounds for Adam and, thereby, to provide -- for the first time -- an unconditional error analysis for Adam for a large class of strongly convex SOPs.
NAJan 19, 2023
The necessity of depth for artificial neural networks to approximate certain classes of smooth and bounded functions without the curse of dimensionalityLukas Gonon, Robin Graeber, Arnulf Jentzen
In this article we study high-dimensional approximation capacities of shallow and deep artificial neural networks (ANNs) with the rectified linear unit (ReLU) activation. In particular, it is a key contribution of this work to reveal that for all $a,b\in\mathbb{R}$ with $b-a\geq 7$ we have that the functions $[a,b]^d\ni x=(x_1,\dots,x_d)\mapsto\prod_{i=1}^d x_i\in\mathbb{R}$ for $d\in\mathbb{N}$ as well as the functions $[a,b]^d\ni x =(x_1,\dots, x_d)\mapsto\sin(\prod_{i=1}^d x_i) \in \mathbb{R} $ for $ d \in \mathbb{N} $ can neither be approximated without the curse of dimensionality by means of shallow ANNs nor insufficiently deep ANNs with ReLU activation but can be approximated without the curse of dimensionality by sufficiently deep ANNs with ReLU activation. We show that the product functions and the sine of the product functions are polynomially tractable approximation problems among the approximating class of deep ReLU ANNs with the number of hidden layers being allowed to grow in the dimension $ d \in \mathbb{N} $. We establish the above outlined statements not only for the product functions and the sine of the product functions but also for other classes of target functions, in particular, for classes of uniformly globally bounded $ C^{ \infty } $-functions with compact support on any $[a,b]^d$ with $a\in\mathbb{R}$, $b\in(a,\infty)$. Roughly speaking, in this work we lay open that simple approximation problems such as approximating the sine or cosine of products cannot be solved in standard implementation frameworks by shallow or insufficiently deep ANNs with ReLU activation in polynomial time, but can be approximated by sufficiently deep ReLU ANNs with the number of parameters growing at most polynomially.
OCNov 10, 2025
Adam symmetry theorem: characterization of the convergence of the stochastic Adam optimizerSteffen Dereich, Thang Do, Arnulf Jentzen et al.
Beside the standard stochastic gradient descent (SGD) method, the Adam optimizer due to Kingma & Ba (2014) is currently probably the best-known optimization method for the training of deep neural networks in artificial intelligence (AI) systems. Despite the popularity and the success of Adam it remains an \emph{open research problem} to provide a rigorous convergence analysis for Adam even for the class of strongly convex SOPs. In one of the main results of this work we establish convergence rates for Adam in terms of the number of gradient steps (convergence rate \nicefrac{1}{2} w.r.t. the size of the learning rate), the size of the mini-batches (convergence rate 1 w.r.t. the size of the mini-batches), and the size of the second moment parameter of Adam (convergence rate 1 w.r.t. the distance of the second moment parameter to 1) for the class of strongly convex SOPs. In a further main result of this work, which we refer to as \emph{Adam symmetry theorem}, we illustrate the optimality of the established convergence rates by proving for a special class of simple quadratic strongly convex SOPs that Adam converges as the number of gradient steps increases to infinity to the solution of the SOP (the unique minimizer of the strongly convex objective function) if and \emph{only} if the random variables in the SOP (the data in the SOP) are \emph{symmetrically distributed}. In particular, in the standard case where the random variables in the SOP are not symmetrically distributed we \emph{disprove} that Adam converges to the minimizer of the SOP as the number of Adam steps increases to infinity. We also complement the conclusions of our convergence analysis and the Adam symmetry theorem by several numerical simulations that indicate the sharpness of the established convergence rates and that illustrate the practical appearance of the phenomena revealed in the \emph{Adam symmetry theorem}.
OCNov 6, 2025
ODE approximation for the Adam algorithm: General and overparametrized settingSteffen Dereich, Arnulf Jentzen, Sebastian Kassing
The Adam optimizer is currently presumably the most popular optimization method in deep learning. In this article we develop an ODE based method to study the Adam optimizer in a fast-slow scaling regime. For fixed momentum parameters and vanishing step-sizes, we show that the Adam algorithm is an asymptotic pseudo-trajectory of the flow of a particular vector field, which is referred to as the Adam vector field. Leveraging properties of asymptotic pseudo-trajectories, we establish convergence results for the Adam algorithm. In particular, in a very general setting we show that if the Adam algorithm converges, then the limit must be a zero of the Adam vector field, rather than a local minimizer or critical point of the objective function. In contrast, in the overparametrized empirical risk minimization setting, the Adam algorithm is able to locally find the set of minima. Specifically, we show that in a neighborhood of the global minima, the objective function serves as a Lyapunov function for the flow induced by the Adam vector field. As a consequence, if the Adam algorithm enters a neighborhood of the global minima infinitely often, it converges to the set of global minima.
NASep 24, 2023
Deep neural networks with ReLU, leaky ReLU, and softplus activation provably overcome the curse of dimensionality for Kolmogorov partial differential equations with Lipschitz nonlinearities in the $L^p$-senseJulia Ackermann, Arnulf Jentzen, Thomas Kruse et al.
Recently, several deep learning (DL) methods for approximating high-dimensional partial differential equations (PDEs) have been proposed. The interest that these methods have generated in the literature is in large part due to simulations which appear to demonstrate that such DL methods have the capacity to overcome the curse of dimensionality (COD) for PDEs in the sense that the number of computational operations they require to achieve a certain approximation accuracy $\varepsilon\in(0,\infty)$ grows at most polynomially in the PDE dimension $d\in\mathbb N$ and the reciprocal of $\varepsilon$. While there is thus far no mathematical result that proves that one of such methods is indeed capable of overcoming the COD, there are now a number of rigorous results in the literature that show that deep neural networks (DNNs) have the expressive power to approximate PDE solutions without the COD in the sense that the number of parameters used to describe the approximating DNN grows at most polynomially in both the PDE dimension $d\in\mathbb N$ and the reciprocal of the approximation accuracy $\varepsilon>0$. Roughly speaking, in the literature it is has been proved for every $T>0$ that solutions $u_d\colon [0,T]\times\mathbb R^d\to \mathbb R$, $d\in\mathbb N$, of semilinear heat PDEs with Lipschitz continuous nonlinearities can be approximated by DNNs with ReLU activation at the terminal time in the $L^2$-sense without the COD provided that the initial value functions $\mathbb R^d\ni x\mapsto u_d(0,x)\in\mathbb R$, $d\in\mathbb N$, can be approximated by ReLU DNNs without the COD. It is the key contribution of this work to generalize this result by establishing this statement in the $L^p$-sense with $p\in(0,\infty)$ and by allowing the activation function to be more general covering the ReLU, the leaky ReLU, and the softplus activation functions as special cases.
OCJul 13, 2022
Normalized gradient flow optimization in the training of ReLU artificial neural networksSimon Eberle, Arnulf Jentzen, Adrian Riekert et al.
The training of artificial neural networks (ANNs) is nowadays a highly relevant algorithmic procedure with many applications in science and industry. Roughly speaking, ANNs can be regarded as iterated compositions between affine linear functions and certain fixed nonlinear functions, which are usually multidimensional versions of a one-dimensional so-called activation function. The most popular choice of such a one-dimensional activation function is the rectified linear unit (ReLU) activation function which maps a real number to its positive part $ \mathbb{R} \ni x \mapsto \max\{ x, 0 \} \in \mathbb{R} $. In this article we propose and analyze a modified variant of the standard training procedure of such ReLU ANNs in the sense that we propose to restrict the negative gradient flow dynamics to a large submanifold of the ANN parameter space, which is a strict $ C^{ \infty } $-submanifold of the entire ANN parameter space that seems to enjoy better regularity properties than the entire ANN parameter space but which is also sufficiently large and sufficiently high dimensional so that it can represent all ANN realization functions that can be represented through the entire ANN parameter space. In the special situation of shallow ANNs with just one-dimensional ANN layers we also prove for every Lipschitz continuous target function that every gradient flow trajectory on this large submanifold of the ANN parameter space is globally bounded. For the standard gradient flow on the entire ANN parameter space with Lipschitz continuous target functions it remains an open problem of research to prove or disprove the global boundedness of gradient flow trajectories even in the situation of shallow ANNs with just one-dimensional ANN layers.
LGJun 27, 2022
On bounds for norms of reparameterized ReLU artificial neural network parameters: sums of fractional powers of the Lipschitz norm control the network parameter vectorArnulf Jentzen, Timo Kröger
It is an elementary fact in the scientific literature that the Lipschitz norm of the realization function of a feedforward fully-connected rectified linear unit (ReLU) artificial neural network (ANN) can, up to a multiplicative constant, be bounded from above by sums of powers of the norm of the ANN parameter vector. Roughly speaking, in this work we reveal in the case of shallow ANNs that the converse inequality is also true. More formally, we prove that the norm of the equivalence class of ANN parameter vectors with the same realization function is, up to a multiplicative constant, bounded from above by the sum of powers of the Lipschitz norm of the ANN realization function (with the exponents $ 1/2 $ and $ 1 $). Moreover, we prove that this upper bound only holds when employing the Lipschitz norm but does neither hold for Hölder norms nor for Sobolev-Slobodeckij norms. Furthermore, we prove that this upper bound only holds for sums of powers of the Lipschitz norm with the exponents $ 1/2 $ and $ 1 $ but does not hold for the Lipschitz norm alone.
OCJan 10, 2025Code
Averaged Adam accelerates stochastic optimization in the training of deep neural network approximations for partial differential equation and optimal control problemsSteffen Dereich, Arnulf Jentzen, Adrian Riekert
Deep learning methods - usually consisting of a class of deep neural networks (DNNs) trained by a stochastic gradient descent (SGD) optimization method - are nowadays omnipresent in data-driven learning problems as well as in scientific computing tasks such as optimal control (OC) and partial differential equation (PDE) problems. In practically relevant learning tasks, often not the plain-vanilla standard SGD optimization method is employed to train the considered class of DNNs but instead more sophisticated adaptive and accelerated variants of the standard SGD method such as the popular Adam optimizer are used. Inspired by the classical Polyak-Ruppert averaging approach, in this work we apply averaged variants of the Adam optimizer to train DNNs to approximately solve exemplary scientific computing problems in the form of PDEs and OC problems. We test the averaged variants of Adam in a series of learning problems including physics-informed neural network (PINN), deep backward stochastic differential equation (deep BSDE), and deep Kolmogorov approximations for PDEs (such as heat, Black-Scholes, Burgers, and Allen-Cahn PDEs), including DNN approximations for OC problems, and including DNN approximations for image classification problems (ResNet for CIFAR-10). In each of the numerical examples the employed averaged variants of Adam outperform the standard Adam and the standard SGD optimizers, particularly, in the situation of the scientific machine learning problems. The Python source codes for the numerical experiments associated to this work can be found on GitHub at https://github.com/deeplearningmethods/averaged-adam.
LGFeb 20, 2025Code
On the logical skills of large language models: evaluations using arbitrarily complex first-order logic problemsShokhrukh Ibragimov, Arnulf Jentzen, Benno Kuckuck
We present a method of generating first-order logic statements whose complexity can be controlled along multiple dimensions. We use this method to automatically create several datasets consisting of questions asking for the truth or falsity of first-order logic statements in Zermelo-Fraenkel set theory. While the resolution of these questions does not require any knowledge beyond basic notation of first-order logic and set theory, it does require a degree of planning and logical reasoning, which can be controlled up to arbitrarily high difficulty by the complexity of the generated statements. Furthermore, we do extensive evaluations of the performance of various large language models, including recent models such as DeepSeek-R1 and OpenAI's o3-mini, on these datasets. All of the datasets along with the code used for generating them, as well as all data from the evaluations is publicly available at https://github.com/bkuckuck/logical-skills-of-llms.
LGOct 14, 2024
Non-convergence to global minimizers in data driven supervised deep learning: Adam and stochastic gradient descent optimization provably fail to converge to global minimizers in the training of deep neural networks with ReLU activationThang Do, Sonja Hannibal, Arnulf Jentzen
Deep learning methods - consisting of a class of deep neural networks (DNNs) trained by a stochastic gradient descent (SGD) optimization method - are nowadays key tools to solve data driven supervised learning problems. Despite the great success of SGD methods in the training of DNNs, it remains a fundamental open problem of research to explain the success and the limitations of such methods in rigorous theoretical terms. In particular, even in the standard setup of data driven supervised learning problems, it remained an open research problem to prove (or disprove) that SGD methods converge in the training of DNNs with the popular rectified linear unit (ReLU) activation function with high probability to global minimizers in the optimization landscape. In this work we answer this question negatively. Specifically, in this work we prove for a large class of SGD methods that the considered optimizer does with high probability not converge to global minimizers of the optimization problem. It turns out that the probability to not converge to a global minimizer converges at least exponentially quickly to one as the width of the first hidden layer of the ANN and the depth of the ANN, respectively, increase. The general non-convergence results of this work do not only apply to the plain vanilla standard SGD method but also to a large class of accelerated and adaptive SGD methods such as the momentum SGD, the Nesterov accelerated SGD, the Adagrad, the RMSProp, the Adam, the Adamax, the AMSGrad, and the Nadam optimizers.
OCFeb 7, 2024
Non-convergence to global minimizers for Adam and stochastic gradient descent optimization and constructions of local minimizers in the training of artificial neural networksArnulf Jentzen, Adrian Riekert
Stochastic gradient descent (SGD) optimization methods such as the plain vanilla SGD method and the popular Adam optimizer are nowadays the method of choice in the training of artificial neural networks (ANNs). Despite the remarkable success of SGD methods in the ANN training in numerical simulations, it remains in essentially all practical relevant scenarios an open problem to rigorously explain why SGD methods seem to succeed to train ANNs. In particular, in most practically relevant supervised learning problems, it seems that SGD methods do with high probability not converge to global minimizers in the optimization landscape of the ANN training problem. Nevertheless, it remains an open problem of research to disprove the convergence of SGD methods to global minimizers. In this work we solve this research problem in the situation of shallow ANNs with the rectified linear unit (ReLU) and related activations with the standard mean square error loss by disproving in the training of such ANNs that SGD methods (such as the plain vanilla SGD, the momentum SGD, the AdaGrad, the RMSprop, and the Adam optimizers) can find a global minimizer with high probability. Even stronger, we reveal in the training of such ANNs that SGD methods do with high probability fail to converge to global minimizers in the optimization landscape. The findings of this work do, however, not disprove that SGD methods succeed to train ANNs since they do not exclude the possibility that SGD methods find good local minimizers whose risk values are close to the risk values of the global minimizers. In this context, another key contribution of this work is to establish the existence of a hierarchical structure of local minimizers with distinct risk values in the optimization landscape of ANN training problems with ReLU and related activations.
NAMay 7, 2025
A brief review of the Deep BSDE method for solving high-dimensional partial differential equationsJiequn Han, Arnulf Jentzen, Weinan E
High-dimensional partial differential equations (PDEs) pose significant challenges for numerical computation due to the curse of dimensionality, which limits the applicability of traditional mesh-based methods. Since 2017, the Deep BSDE method has introduced deep learning techniques that enable the effective solution of nonlinear PDEs in very high dimensions. This innovation has sparked considerable interest in using neural networks for high-dimensional PDEs, making it an active area of research. In this short review, we briefly sketch the Deep BSDE method, its subsequent developments, and future directions for the field.
LGMar 3, 2025
Non-convergence to the optimal risk for Adam and stochastic gradient descent optimization in the training of deep neural networksThang Do, Arnulf Jentzen, Adrian Riekert
Despite the omnipresent use of stochastic gradient descent (SGD) optimization methods in the training of deep neural networks (DNNs), it remains, in basically all practically relevant scenarios, a fundamental open problem to provide a rigorous theoretical explanation for the success (and the limitations) of SGD optimization methods in deep learning. In particular, it remains an open question to prove or disprove convergence of the true risk of SGD optimization methods to the optimal true risk value in the training of DNNs. In one of the main results of this work we reveal for a general class of activations, loss functions, random initializations, and SGD optimization methods (including, for example, standard SGD, momentum SGD, Nesterov accelerated SGD, Adagrad, RMSprop, Adadelta, Adam, Adamax, Nadam, Nadamax, and AMSGrad) that in the training of any arbitrary fully-connected feedforward DNN it does not hold that the true risk of the considered optimizer converges in probability to the optimal true risk value. Nonetheless, the true risk of the considered SGD optimization method may very well converge to a strictly suboptimal true risk value.
LGDec 2, 2024
An overview of diffusion models for generative artificial intelligenceDavide Gallon, Arnulf Jentzen, Philippe von Wurstemberger
This article provides a mathematically rigorous introduction to denoising diffusion probabilistic models (DDPMs), sometimes also referred to as diffusion probabilistic models or diffusion models, for generative artificial intelligence. We provide a detailed basic mathematical framework for DDPMs and explain the main ideas behind training and generation procedures. In this overview article we also review selected extensions and improvements of the basic framework from the literature such as improved DDPMs, denoising diffusion implicit models, classifier-free diffusion guidance models, and latent diffusion models.
LGMay 14, 2025
SAD Neural Networks: Divergent Gradient Flows and Asymptotic Optimality via o-minimal StructuresJulian Kranz, Davide Gallon, Steffen Dereich et al.
We study gradient flows for loss landscapes of fully connected feedforward neural networks with commonly used continuously differentiable activation functions such as the logistic, hyperbolic tangent, softplus or GELU function. We prove that the gradient flow either converges to a critical point or diverges to infinity while the loss converges to an asymptotic critical value. Moreover, we prove the existence of a threshold $\varepsilon>0$ such that the loss value of any gradient flow initialized at most $\varepsilon$ above the optimal level converges to it. For polynomial target functions and sufficiently big architecture and data set, we prove that the optimal loss value is zero and can only be realized asymptotically. From this setting, we deduce our main result that any gradient flow with sufficiently good initialization diverges to infinity. Our proof heavily relies on the geometry of o-minimal structures. We confirm these theoretical findings with numerical experiments and extend our investigation to more realistic scenarios, where we observe an analogous behavior.
OCJun 28, 2025
Deep neural networks can provably solve Bellman equations for Markov decision processes without the curse of dimensionalityArnulf Jentzen, Konrad Kleinberg, Thomas Kruse
Discrete time stochastic optimal control problems and Markov decision processes (MDPs) are fundamental models for sequential decision-making under uncertainty and as such provide the mathematical framework underlying reinforcement learning theory. A central tool for solving MDPs is the Bellman equation and its solution, the so-called $Q$-function. In this article, we construct deep neural network (DNN) approximations for $Q$-functions associated to MDPs with infinite time horizon and finite control set $A$. More specifically, we show that if the the payoff function and the random transition dynamics of the MDP can be suitably approximated by DNNs with leaky rectified linear unit (ReLU) activation, then the solutions $Q_d\colon \mathbb R^d\to \mathbb R^{|A|}$, $d\in \mathbb{N}$, of the associated Bellman equations can also be approximated in the $L^2$-sense by DNNs with leaky ReLU activation whose numbers of parameters grow at most polynomially in both the dimension $d\in \mathbb{N}$ of the state space and the reciprocal $1/\varepsilon$ of the prescribed error $\varepsilon\in (0,1)$. Our proof relies on the recently introduced full-history recursive multilevel fixed-point (MLFP) approximation scheme.
OCApr 28, 2025
Sharp higher order convergence rates for the Adam optimizerSteffen Dereich, Arnulf Jentzen, Adrian Riekert
Gradient descent based optimization methods are the methods of choice to train deep neural networks in machine learning. Beyond the standard gradient descent method, also suitable modified variants of standard gradient descent involving acceleration techniques such as the momentum method and/or adaptivity techniques such as the RMSprop method are frequently considered optimization methods. These days the most popular of such sophisticated optimization schemes is presumably the Adam optimizer that has been proposed in 2014 by Kingma and Ba. A highly relevant topic of research is to investigate the speed of convergence of such optimization methods. In particular, in 1964 Polyak showed that the standard gradient descent method converges in a neighborhood of a strict local minimizer with rate (x - 1)(x + 1)^{-1} while momentum achieves the (optimal) strictly faster convergence rate (\sqrt{x} - 1)(\sqrt{x} + 1)^{-1} where x \in (1,\infty) is the condition number (the ratio of the largest and the smallest eigenvalue) of the Hessian of the objective function at the local minimizer. It is the key contribution of this work to reveal that Adam also converges with the strictly faster convergence rate (\sqrt{x} - 1)(\sqrt{x} + 1)^{-1} while RMSprop only converges with the convergence rate (x - 1)(x + 1)^{-1}.
LGJan 26, 2025
Mathematical analysis of the gradients in deep learningSteffen Dereich, Thang Do, Arnulf Jentzen et al.
Deep learning algorithms -- typically consisting of a class of deep artificial neural networks (ANNs) trained by a stochastic gradient descent (SGD) optimization method -- are nowadays an integral part in many areas of science, industry, and also our day to day life. Roughly speaking, in their most basic form, ANNs can be regarded as functions that consist of a series of compositions of affine-linear functions with multidimensional versions of so-called activation functions. One of the most popular of such activation functions is the rectified linear unit (ReLU) function $\mathbb{R} \ni x \mapsto \max\{ x, 0 \} \in \mathbb{R}$. The ReLU function is, however, not differentiable and, typically, this lack of regularity transfers to the cost function of the supervised learning problem under consideration. Regardless of this lack of differentiability issue, deep learning practioners apply SGD methods based on suitably generalized gradients in standard deep learning libraries like {\sc TensorFlow} or {\sc Pytorch}. In this work we reveal an accurate and concise mathematical description of such generalized gradients in the training of deep fully-connected feedforward ANNs and we also study the resulting generalized gradient function analytically. Specifically, we provide an appropriate approximation procedure that uniquely describes the generalized gradient function, we prove that the generalized gradients are limiting Fréchet subgradients of the cost functional, and we conclude that the generalized gradients must coincide with the standard gradient of the cost functional on every open sets on which the cost functional is continuously differentiable.
NAAug 23, 2025
Error analysis for the deep Kolmogorov methodIulian Cîmpean, Thang Do, Lukas Gonon et al.
The deep Kolmogorov method is a simple and popular deep learning based method for approximating solutions of partial differential equations (PDEs) of the Kolmogorov type. In this work we provide an error analysis for the deep Kolmogorov method for heat PDEs. Specifically, we reveal convergence with convergence rates for the overall mean square distance between the exact solution of the heat PDE and the realization function of the approximating deep neural network (DNN) associated with a stochastic optimization algorithm in terms of the size of the architecture (the depth/number of hidden layers and the width of the hidden layers) of the approximating DNN, in terms of the number of random sample points used in the loss function (the number of input-output data pairs used in the loss function), and in terms of the size of the optimization error made by the employed stochastic optimization method.
OCMay 28, 2025
PADAM: Parallel averaged Adam reduces the error for stochastic optimization in scientific machine learningArnulf Jentzen, Julian Kranz, Adrian Riekert
Averaging techniques such as Ruppert--Polyak averaging and exponential movering averaging (EMA) are powerful approaches to accelerate optimization procedures of stochastic gradient descent (SGD) optimization methods such as the popular ADAM optimizer. However, depending on the specific optimization problem under consideration, the type and the parameters for the averaging need to be adjusted to achieve the smallest optimization error. In this work we propose an averaging approach, which we refer to as parallel averaged ADAM (PADAM), in which we compute parallely different averaged variants of ADAM and during the training process dynamically select the variant with the smallest optimization error. A central feature of this approach is that this procedure requires no more gradient evaluations than the usual ADAM optimizer as each of the averaged trajectories relies on the same underlying ADAM trajectory and thus on the same underlying gradients. We test the proposed PADAM optimizer in 13 stochastic optimization and deep neural network (DNN) learning problems and compare its performance with known optimizers from the literature such as standard SGD, momentum SGD, Adam with and without EMA, and ADAMW. In particular, we apply the compared optimizers to physics-informed neural network, deep Galerkin, deep backward stochastic differential equation and deep Kolmogorov approximations for boundary value partial differential equation problems from scientific machine learning, as well as to DNN approximations for optimal control and optimal stopping problems. In nearly all of the considered examples PADAM achieves, sometimes among others and sometimes exclusively, essentially the smallest optimization error. This work thus strongly suggest to consider PADAM for scientific machine learning problems and also motivates further research for adaptive averaging procedures within the training of DNNs.
OCJun 20, 2024
Learning rate adaptive stochastic gradient descent optimization methods: numerical simulations for deep learning methods for partial differential equations and convergence analysesSteffen Dereich, Arnulf Jentzen, Adrian Riekert
It is known that the standard stochastic gradient descent (SGD) optimization method, as well as accelerated and adaptive SGD optimization methods such as the Adam optimizer fail to converge if the learning rates do not converge to zero (as, for example, in the situation of constant learning rates). Numerical simulations often use human-tuned deterministic learning rate schedules or small constant learning rates. The default learning rate schedules for SGD optimization methods in machine learning implementation frameworks such as TensorFlow and Pytorch are constant learning rates. In this work we propose and study a learning-rate-adaptive approach for SGD optimization methods in which the learning rate is adjusted based on empirical estimates for the values of the objective function of the considered optimization problem (the function that one intends to minimize). In particular, we propose a learning-rate-adaptive variant of the Adam optimizer and implement it in case of several neural network learning problems, particularly, in the context of deep learning approximation methods for partial differential equations such as deep Kolmogorov methods, physics-informed neural networks, and deep Ritz methods. In each of the presented learning problems the proposed learning-rate-adaptive variant of the Adam optimizer faster reduces the value of the objective function than the Adam optimizer with the default learning rate. For a simple class of quadratic minimization problems we also rigorously prove that a learning-rate-adaptive variant of the SGD optimization method converges to the minimizer of the considered minimization problem. Our convergence proof is based on an analysis of the laws of invariant measures of the SGD method as well as on a more general convergence analysis for SGD with random but predictable learning rates which we develop in this work.
LGJun 16, 2024
Deep neural networks with ReLU, leaky ReLU, and softplus activation provably overcome the curse of dimensionality for space-time solutions of semilinear partial differential equationsJulia Ackermann, Arnulf Jentzen, Benno Kuckuck et al.
It is a challenging topic in applied mathematics to solve high-dimensional nonlinear partial differential equations (PDEs). Standard approximation methods for nonlinear PDEs suffer under the curse of dimensionality (COD) in the sense that the number of computational operations of the approximation method grows at least exponentially in the PDE dimension and with such methods it is essentially impossible to approximately solve high-dimensional PDEs even when the fastest currently available computers are used. However, in the last years great progress has been made in this area of research through suitable deep learning (DL) based methods for PDEs in which deep neural networks (DNNs) are used to approximate solutions of PDEs. Despite the remarkable success of such DL methods in simulations, it remains a fundamental open problem of research to prove (or disprove) that such methods can overcome the COD in the approximation of PDEs. However, there are nowadays several partial error analysis results for DL methods for high-dimensional nonlinear PDEs in the literature which prove that DNNs can overcome the COD in the sense that the number of parameters of the approximating DNN grows at most polynomially in both the reciprocal of the prescribed approximation accuracy $\varepsilon>0$ and the PDE dimension $d\in\mathbb{N}$. In the main result of this article we prove that for all $T,p\in(0,\infty)$ it holds that solutions $u_d\colon[0,T]\times\mathbb{R}^d\to\mathbb{R}$, $d\in\mathbb{N}$, of semilinear heat equations with Lipschitz continuous nonlinearities can be approximated in the $L^p$-sense on space-time regions without the COD by DNNs with the rectified linear unit (ReLU), the leaky ReLU, or the softplus activation function. In previous articles similar results have been established not for space-time regions but for the solutions $u_d(T,\cdot)$, $d\in\mathbb{N}$, at the terminal time $T$.
OCDec 17, 2021
On the existence of global minima and convergence analyses for gradient descent methods in the training of deep neural networksArnulf Jentzen, Adrian Riekert
In this article we study fully-connected feedforward deep ReLU ANNs with an arbitrarily large number of hidden layers and we prove convergence of the risk of the GD optimization method with random initializations in the training of such ANNs under the assumption that the unnormalized probability density function of the probability distribution of the input data of the considered supervised learning problem is piecewise polynomial, under the assumption that the target function (describing the relationship between input data and the output data) is piecewise polynomial, and under the assumption that the risk function of the considered supervised learning problem admits at least one regular global minimum. In addition, in the special situation of shallow ANNs with just one hidden layer and one-dimensional input we also verify this assumption by proving in the training of such shallow ANNs that for every Lipschitz continuous target function there exists a global minimum in the risk landscape. Finally, in the training of deep ANNs with ReLU activation we also study solutions of gradient flow (GF) differential equations and we prove that every non-divergent GF trajectory converges with a polynomial rate of convergence to a critical point (in the sense of limiting Fréchet subdifferentiability). Our mathematical convergence analysis builds up on ideas from our previous article Eberle et al., on tools from real algebraic geometry such as the concept of semi-algebraic functions and generalized Kurdyka-Lojasiewicz inequalities, on tools from functional analysis such as the Arzelà-Ascoli theorem, on tools from nonsmooth analysis such as the concept of limiting Fréchet subgradients, as well as on the fact that the set of realization functions of shallow ReLU ANNs with fixed architecture forms a closed subset of the set of continuous functions revealed by Petersen et al.
LGDec 13, 2021
Convergence proof for stochastic gradient descent in the training of deep neural networks with ReLU activation for constant target functionsMartin Hutzenthaler, Arnulf Jentzen, Katharina Pohl et al.
In many numerical simulations stochastic gradient descent (SGD) type optimization methods perform very effectively in the training of deep neural networks (DNNs) but till this day it remains an open problem of research to provide a mathematical convergence analysis which rigorously explains the success of SGD type optimization methods in the training of DNNs. In this work we study SGD type optimization methods in the training of fully-connected feedforward DNNs with rectified linear unit (ReLU) activation. We first establish general regularity properties for the risk functions and their generalized gradient functions appearing in the training of such DNNs and, thereafter, we investigate the plain vanilla SGD optimization method in the training of such DNNs under the assumption that the target function under consideration is a constant function. Specifically, we prove under the assumption that the learning rates (the step sizes of the SGD optimization method) are sufficiently small but not $L^1$-summable and under the assumption that the target function is a constant function that the expectation of the riskof the considered SGD process converges in the training of such DNNs to zero as the number of SGD steps increases to infinity.
LGAug 18, 2021
Existence, uniqueness, and convergence rates for gradient flows in the training of artificial neural networks with ReLU activationSimon Eberle, Arnulf Jentzen, Adrian Riekert et al.
The training of artificial neural networks (ANNs) with rectified linear unit (ReLU) activation via gradient descent (GD) type optimization schemes is nowadays a common industrially relevant procedure. Till this day in the scientific literature there is in general no mathematical convergence analysis which explains the numerical success of GD type optimization schemes in the training of ANNs with ReLU activation. GD type optimization schemes can be regarded as temporal discretization methods for the gradient flow (GF) differential equations associated to the considered optimization problem and, in view of this, it seems to be a natural direction of research to first aim to develop a mathematical convergence theory for time-continuous GF differential equations and, thereafter, to aim to extend such a time-continuous convergence theory to implementable time-discrete GD type optimization methods. In this article we establish two basic results for GF differential equations in the training of fully-connected feedforward ANNs with one hidden layer and ReLU activation. In the first main result of this article we establish in the training of such ANNs under the assumption that the probability distribution of the input data of the considered supervised learning problem is absolutely continuous with a bounded density function that every GF differential equation admits for every initial value a solution which is also unique among a suitable class of solutions. In the second main result of this article we prove in the training of such ANNs under the assumption that the target function and the density function of the probability distribution of the input data are piecewise polynomial that every non-divergent GF trajectory converges with an appropriate rate of convergence to a critical point and that the risk of the non-divergent GF trajectory converges with rate 1 to the risk of the critical point.
OCAug 10, 2021
A proof of convergence for the gradient descent optimization method with random initializations in the training of neural networks with ReLU activation for piecewise linear target functionsArnulf Jentzen, Adrian Riekert
Gradient descent (GD) type optimization methods are the standard instrument to train artificial neural networks (ANNs) with rectified linear unit (ReLU) activation. Despite the great success of GD type optimization methods in numerical simulations for the training of ANNs with ReLU activation, it remains - even in the simplest situation of the plain vanilla GD optimization method with random initializations and ANNs with one hidden layer - an open problem to prove (or disprove) the conjecture that the risk of the GD optimization method converges in the training of such ANNs to zero as the width of the ANNs, the number of independent random initializations, and the number of GD steps increase to infinity. In this article we prove this conjecture in the situation where the probability distribution of the input data is equivalent to the continuous uniform distribution on a compact interval, where the probability distributions for the random initializations of the ANN parameters are standard normal distributions, and where the target function under consideration is continuous and piecewise affine linear. Roughly speaking, the key ingredients in our mathematical convergence analysis are (i) to prove that suitable sets of global minima of the risk functions are \emph{twice continuously differentiable submanifolds of the ANN parameter spaces}, (ii) to prove that the Hessians of the risk functions on these sets of global minima satisfy an appropriate \emph{maximal rank condition}, and, thereafter, (iii) to apply the machinery in [Fehrman, B., Gess, B., Jentzen, A., Convergence rates for the stochastic gradient descent method for non-convex objective functions. J. Mach. Learn. Res. 21(136): 1--48, 2020] to establish convergence of the GD optimization method with random initializations.