LGMar 19
Uniform a priori bounds and error analysis for the Adam stochastic gradient descent optimization methodSteffen Dereich, Thang Do, Arnulf Jentzen
The adaptive moment estimation (Adam) optimizer proposed by Kingma & Ba (2014) is presumably the most popular stochastic gradient descent (SGD) optimization method for the training of deep neural networks (DNNs) in artificial intelligence (AI) systems. Despite its groundbreaking success in the training of AI systems, it still remains an open research problem to provide a complete error analysis of Adam, not only for optimizing DNNs but even when applied to strongly convex stochastic optimization problems (SOPs). Previous error analysis results for strongly convex SOPs in the literature provide conditional convergence analyses that rely on the assumption that Adam does not diverge to infinity but remains uniformly bounded. It is the key contribution of this work to establish uniform a priori bounds for Adam and, thereby, to provide -- for the first time -- an unconditional error analysis for Adam for a large class of strongly convex SOPs.
OCNov 10, 2025
Adam symmetry theorem: characterization of the convergence of the stochastic Adam optimizerSteffen Dereich, Thang Do, Arnulf Jentzen et al.
Beside the standard stochastic gradient descent (SGD) method, the Adam optimizer due to Kingma & Ba (2014) is currently probably the best-known optimization method for the training of deep neural networks in artificial intelligence (AI) systems. Despite the popularity and the success of Adam it remains an \emph{open research problem} to provide a rigorous convergence analysis for Adam even for the class of strongly convex SOPs. In one of the main results of this work we establish convergence rates for Adam in terms of the number of gradient steps (convergence rate \nicefrac{1}{2} w.r.t. the size of the learning rate), the size of the mini-batches (convergence rate 1 w.r.t. the size of the mini-batches), and the size of the second moment parameter of Adam (convergence rate 1 w.r.t. the distance of the second moment parameter to 1) for the class of strongly convex SOPs. In a further main result of this work, which we refer to as \emph{Adam symmetry theorem}, we illustrate the optimality of the established convergence rates by proving for a special class of simple quadratic strongly convex SOPs that Adam converges as the number of gradient steps increases to infinity to the solution of the SOP (the unique minimizer of the strongly convex objective function) if and \emph{only} if the random variables in the SOP (the data in the SOP) are \emph{symmetrically distributed}. In particular, in the standard case where the random variables in the SOP are not symmetrically distributed we \emph{disprove} that Adam converges to the minimizer of the SOP as the number of Adam steps increases to infinity. We also complement the conclusions of our convergence analysis and the Adam symmetry theorem by several numerical simulations that indicate the sharpness of the established convergence rates and that illustrate the practical appearance of the phenomena revealed in the \emph{Adam symmetry theorem}.
LGOct 14, 2024
Non-convergence to global minimizers in data driven supervised deep learning: Adam and stochastic gradient descent optimization provably fail to converge to global minimizers in the training of deep neural networks with ReLU activationThang Do, Sonja Hannibal, Arnulf Jentzen
Deep learning methods - consisting of a class of deep neural networks (DNNs) trained by a stochastic gradient descent (SGD) optimization method - are nowadays key tools to solve data driven supervised learning problems. Despite the great success of SGD methods in the training of DNNs, it remains a fundamental open problem of research to explain the success and the limitations of such methods in rigorous theoretical terms. In particular, even in the standard setup of data driven supervised learning problems, it remained an open research problem to prove (or disprove) that SGD methods converge in the training of DNNs with the popular rectified linear unit (ReLU) activation function with high probability to global minimizers in the optimization landscape. In this work we answer this question negatively. Specifically, in this work we prove for a large class of SGD methods that the considered optimizer does with high probability not converge to global minimizers of the optimization problem. It turns out that the probability to not converge to a global minimizer converges at least exponentially quickly to one as the width of the first hidden layer of the ANN and the depth of the ANN, respectively, increase. The general non-convergence results of this work do not only apply to the plain vanilla standard SGD method but also to a large class of accelerated and adaptive SGD methods such as the momentum SGD, the Nesterov accelerated SGD, the Adagrad, the RMSProp, the Adam, the Adamax, the AMSGrad, and the Nadam optimizers.
LGMar 3, 2025
Non-convergence to the optimal risk for Adam and stochastic gradient descent optimization in the training of deep neural networksThang Do, Arnulf Jentzen, Adrian Riekert
Despite the omnipresent use of stochastic gradient descent (SGD) optimization methods in the training of deep neural networks (DNNs), it remains, in basically all practically relevant scenarios, a fundamental open problem to provide a rigorous theoretical explanation for the success (and the limitations) of SGD optimization methods in deep learning. In particular, it remains an open question to prove or disprove convergence of the true risk of SGD optimization methods to the optimal true risk value in the training of DNNs. In one of the main results of this work we reveal for a general class of activations, loss functions, random initializations, and SGD optimization methods (including, for example, standard SGD, momentum SGD, Nesterov accelerated SGD, Adagrad, RMSprop, Adadelta, Adam, Adamax, Nadam, Nadamax, and AMSGrad) that in the training of any arbitrary fully-connected feedforward DNN it does not hold that the true risk of the considered optimizer converges in probability to the optimal true risk value. Nonetheless, the true risk of the considered SGD optimization method may very well converge to a strictly suboptimal true risk value.
LGJan 26, 2025
Mathematical analysis of the gradients in deep learningSteffen Dereich, Thang Do, Arnulf Jentzen et al.
Deep learning algorithms -- typically consisting of a class of deep artificial neural networks (ANNs) trained by a stochastic gradient descent (SGD) optimization method -- are nowadays an integral part in many areas of science, industry, and also our day to day life. Roughly speaking, in their most basic form, ANNs can be regarded as functions that consist of a series of compositions of affine-linear functions with multidimensional versions of so-called activation functions. One of the most popular of such activation functions is the rectified linear unit (ReLU) function $\mathbb{R} \ni x \mapsto \max\{ x, 0 \} \in \mathbb{R}$. The ReLU function is, however, not differentiable and, typically, this lack of regularity transfers to the cost function of the supervised learning problem under consideration. Regardless of this lack of differentiability issue, deep learning practioners apply SGD methods based on suitably generalized gradients in standard deep learning libraries like {\sc TensorFlow} or {\sc Pytorch}. In this work we reveal an accurate and concise mathematical description of such generalized gradients in the training of deep fully-connected feedforward ANNs and we also study the resulting generalized gradient function analytically. Specifically, we provide an appropriate approximation procedure that uniquely describes the generalized gradient function, we prove that the generalized gradients are limiting Fréchet subgradients of the cost functional, and we conclude that the generalized gradients must coincide with the standard gradient of the cost functional on every open sets on which the cost functional is continuously differentiable.
NAAug 23, 2025
Error analysis for the deep Kolmogorov methodIulian Cîmpean, Thang Do, Lukas Gonon et al.
The deep Kolmogorov method is a simple and popular deep learning based method for approximating solutions of partial differential equations (PDEs) of the Kolmogorov type. In this work we provide an error analysis for the deep Kolmogorov method for heat PDEs. Specifically, we reveal convergence with convergence rates for the overall mean square distance between the exact solution of the heat PDE and the realization function of the approximating deep neural network (DNN) associated with a stochastic optimization algorithm in terms of the size of the architecture (the depth/number of hidden layers and the width of the hidden layers) of the approximating DNN, in terms of the number of random sample points used in the loss function (the number of input-output data pairs used in the loss function), and in terms of the size of the optimization error made by the employed stochastic optimization method.