LGApr 4, 2023Code
FakET: Simulating Cryo-Electron Tomograms with Neural Style TransferPavol Harar, Lukas Herrmann, Philipp Grohs et al.
In cryo-electron microscopy, accurate particle localization and classification are imperative. Recent deep learning solutions, though successful, require extensive training data sets. The protracted generation time of physics-based models, often employed to produce these data sets, limits their broad applicability. We introduce FakET, a method based on Neural Style Transfer, capable of simulating the forward operator of any cryo transmission electron microscope. It can be used to adapt a synthetic training data set according to reference data producing high-quality simulated micrographs or tilt-series. To assess the quality of our generated data, we used it to train a state-of-the-art localization and classification architecture and compared its performance with a counterpart trained on benchmark data. Remarkably, our technique matches the performance, boosts data generation speed 750 times, uses 33 times less memory, and scales well to typical transmission electron microscope detector sizes. It leverages GPU acceleration and parallel processing. The source code is available at https://github.com/paloha/faket.
LGMay 19, 2022
Gold-standard solutions to the Schrödinger equation using deep learning: How much physics do we need?Leon Gerard, Michael Scherbela, Philipp Marquetand et al.
Finding accurate solutions to the Schrödinger equation is the key unsolved challenge of computational chemistry. Given its importance for the development of new chemical compounds, decades of research have been dedicated to this problem, but due to the large dimensionality even the best available methods do not yet reach the desired accuracy. Recently the combination of deep learning with Monte Carlo methods has emerged as a promising way to obtain highly accurate energies and moderate scaling of computational cost. In this paper we significantly contribute towards this goal by introducing a novel deep-learning architecture that achieves 40-70% lower energy error at 6x lower computational cost compared to previous approaches. Using our method we establish a new benchmark by calculating the most accurate variational ground state energies ever published for a number of different atoms and molecules. We systematically break down and measure our improvements, focusing in particular on the effect of increasing physical prior knowledge. We surprisingly find that increasing the prior knowledge given to the architecture can actually decrease accuracy.
NAMar 17, 2018
Projection-Based Finite Elements for Nonlinear Function SpacesPhilipp Grohs, Hanne Hardering, Oliver Sander et al.
We introduce a novel type of approximation spaces for functions with values in a nonlinear manifold. The discrete functions are constructed by piecewise polynomial interpolation in a Euclidean embedding space, and then projecting pointwise onto the manifold. We show optimal interpolation error bounds with respect to Lebesgue and Sobolev norms. Additionally, we show similar bounds for the test functions, i.e., variations of discrete functions. Combining these results with a nonlinear Céa lemma, we prove optimal $L^2$ and $H^1$ discretization error bounds for harmonic maps from a planar domain into a smooth manifold. All these error bounds are also verified numerically.
COMP-PHMar 17, 2023
Towards a Foundation Model for Neural Network WavefunctionsMichael Scherbela, Leon Gerard, Philipp Grohs
Deep neural networks have become a highly accurate and powerful wavefunction ansatz in combination with variational Monte Carlo methods for solving the electronic Schrödinger equation. However, despite their success and favorable scaling, these methods are still computationally too costly for wide adoption. A significant obstacle is the requirement to optimize the wavefunction from scratch for each new system, thus requiring long optimization. In this work, we propose a novel neural network ansatz, which effectively maps uncorrelated, computationally cheap Hartree-Fock orbitals, to correlated, high-accuracy neural network orbitals. This ansatz is inherently capable of learning a single wavefunction across multiple compounds and geometries, as we demonstrate by successfully transferring a wavefunction model pre-trained on smaller fragments to larger compounds. Furthermore, we provide ample experimental evidence to support the idea that extensive pre-training of a such a generalized wavefunction model across different compounds and geometries could lead to a foundation wavefunction model. Such a model could yield high-accuracy ab-initio energies using only minimal computational effort for fine-tuning and evaluation of observables.
CHEM-PHJul 15, 2023
Variational Monte Carlo on a Budget -- Fine-tuning pre-trained Neural WavefunctionsMichael Scherbela, Leon Gerard, Philipp Grohs
Obtaining accurate solutions to the Schrödinger equation is the key challenge in computational quantum chemistry. Deep-learning-based Variational Monte Carlo (DL-VMC) has recently outperformed conventional approaches in terms of accuracy, but only at large computational cost. Whereas in many domains models are trained once and subsequently applied for inference, accurate DL-VMC so far requires a full optimization for every new problem instance, consuming thousands of GPUhs even for small molecules. We instead propose a DL-VMC model which has been pre-trained using self-supervised wavefunction optimization on a large and chemically diverse set of molecules. Applying this model to new molecules without any optimization, yields wavefunctions and absolute energies that outperform established methods such as CCSD(T)-2Z. To obtain accurate relative energies, only few fine-tuning steps of this base model are required. We accomplish this with a fully end-to-end machine-learned model, consisting of an improved geometry embedding architecture and an existing SE(3)-equivariant model to represent molecular orbitals. Combining this architecture with continuous sampling of geometries, we improve zero-shot accuracy by two orders of magnitude compared to the state of the art. We extensively evaluate the accuracy, scalability and limitations of our base model on a wide variety of test systems.
LGMay 26, 2022
Learning ReLU networks to high uniform accuracy is intractableJulius Berner, Philipp Grohs, Felix Voigtlaender
Statistical learning theory provides bounds on the necessary number of training samples needed to reach a prescribed accuracy in a learning problem formulated over a given target class. This accuracy is typically measured in terms of a generalization error, that is, an expected value of a given loss function. However, for several applications -- for example in a security-critical context or for problems in the computational sciences -- accuracy in this sense is not sufficient. In such cases, one would like to have guarantees for high accuracy on every input value, that is, with respect to the uniform norm. In this paper we precisely quantify the number of training samples needed for any conceivable training algorithm to guarantee a given uniform accuracy on any learning problem formulated over target classes containing (or consisting of) ReLU neural networks of a prescribed architecture. We prove that, under very general assumptions, the minimal number of training samples for this task scales exponentially both in the depth and the input dimension of the network architecture.
NAJan 21, 2016
On the Approximation of Functions with Line Singularities by RidgeletsAxel Obermeier, Philipp Grohs
In [GO15], the authors discussed the existence of numerically feasible solvers for advection equations that run in optimal computational complexity. In this paper, we complete the last remaining requirement to achieve this goal - by showing that ridgelets, on which the solver is based, approximate functions with line singularities (which may appear as solutions to the advection equation) with the best possible approximation rate. Structurally, the proof resembles [Can01], where a similar result was proved for a different ridgelet construction, which is however not well-suited for use in a PDE solver (and in particular, not suitable for the CDD-schemes [CDD01] we are interested in). Due to the differences between the two ridgelet constructions, we have to deal with quite a different set of issues, but are also able to relax the (support) conditions on the function being approximated. Finally, the proof employs a new convolution-type estimate that could be of independent interest due to its sharpness.
OCMay 4
Robust and Fast Training via Per-Sample ClippingDavide Nobile, Philipp Grohs
We propose a robust gradient estimator based on per-sample gradient clipping and analyze its properties both theoretically and empirically. We show that the resulting method, per-sample clipped SGD (PS-Clip-SGD), achieves optimal in-expectation convergence rates for non-convex optimization problems under heavy-tailed gradient noise. Moreover, we establish high-probability convergence guarantees that match the in-expectation rates up to polylogarithmic factors in the failure probability. We complement our theoretical results with multiple numerical experiments. In particular, we demonstrate that PS-Clip-SGD outperforms both vanilla SGD with momentum and standard gradient clipping when training AlexNet on the CIFAR-100 dataset, even after accounting for the additional computational time caused by per-sample clipping. We also empirically show that, in the presence of gradient accumulation, applying clipping at the mini-batch level can improve training performance while incurring virtually no additional computational cost. This finding is particularly interesting, as it contradicts the common practice of applying clipping only after all accumulation steps have been completed.
COMP-PHApr 8, 2025
Accurate Ab-initio Neural-network Solutions to Large-Scale Electronic Structure ProblemsMichael Scherbela, Nicholas Gao, Philipp Grohs et al.
We present finite-range embeddings (FiRE), a novel wave function ansatz for accurate large-scale ab-initio electronic structure calculations. Compared to contemporary neural-network wave functions, FiRE reduces the asymptotic complexity of neural-network variational Monte Carlo (NN-VMC) by $\sim n_\text{el}$, the number of electrons. By restricting electron-electron interactions within the neural network, FiRE accelerates all key operations -- sampling, pseudopotentials, and Laplacian computations -- resulting in a real-world $10\times$ acceleration in now-feasible 180-electron calculations. We validate our method's accuracy on various challenging systems, including biochemical compounds, conjugated hydrocarbons, and organometallic compounds. On these systems, FiRE's energies are consistently within chemical accuracy of the most reliable data, including experiments, even in cases where high-accuracy methods such as CCSD(T), AFQMC, or contemporary NN-VMC fall short. With these improvements in both runtime and accuracy, FiRE represents a new `gold-standard' method for fast and accurate large-scale ab-initio calculations, potentially enabling new computational studies in fields like quantum chemistry, solid-state physics, and material design.
COMP-PHMay 13, 2024
Transferable Neural Wavefunctions for SolidsLeon Gerard, Michael Scherbela, Halvard Sutterud et al.
Deep-Learning-based Variational Monte Carlo (DL-VMC) has recently emerged as a highly accurate approach for finding approximate solutions to the many-electron Schrödinger equation. Despite its favorable scaling with the number of electrons, $\mathcal{O}(n_\text{el}^{4})$, the practical value of DL-VMC is limited by the high cost of optimizing the neural network weights for every system studied. To mitigate this problem, recent research has proposed optimizing a single neural network across multiple systems, reducing the cost per system. Here we extend this approach to solids, where similar but distinct calculations using different geometries, boundary conditions, and supercell sizes are often required. We show how to optimize a single ansatz across all of these variations, reducing the required number of optimization steps by an order of magnitude. Furthermore, we exploit the transfer capabilities of a pre-trained network. We successfully transfer a network, pre-trained on 2x2x2 supercells of LiH, to 3x3x3 supercells. This reduces the number of optimization steps required to simulate the large system by a factor of 50 compared to previous work.
LGMar 23, 2025
Theory-to-Practice Gap for Neural Networks and Neural OperatorsPhilipp Grohs, Samuel Lanthaler, Margaret Trautner
This work studies the sampling complexity of learning with ReLU neural networks and neural operators. For mappings belonging to relevant approximation spaces, we derive upper bounds on the best-possible convergence rate of any learning algorithm, with respect to the number of samples. In the finite-dimensional case, these bounds imply a gap between the parametric and sampling complexities of learning, known as the \emph{theory-to-practice gap}. In this work, a unified treatment of the theory-to-practice gap is achieved in a general $L^p$-setting, while at the same time improving available bounds in the literature. Furthermore, based on these results the theory-to-practice gap is extended to the infinite-dimensional setting of operator learning. Our results apply to Deep Operator Networks and integral kernel-based neural operators, including the Fourier neural operator. We show that the best-possible convergence rate in a Bochner $L^p$-norm is bounded by Monte-Carlo rates of order $1/p$.
LGDec 20, 2023
Sampling Complexity of Deep Approximation SpacesAhmed Abdeljawad, Philipp Grohs
While it is well-known that neural networks enjoy excellent approximation capabilities, it remains a big challenge to compute such approximations from point samples. Based on tools from Information-based complexity, recent work by Grohs and Voigtlaender [Journal of the FoCM (2023)] developed a rigorous framework for assessing this so-called "theory-to-practice gap". More precisely, in that work it is shown that there exist functions that can be approximated by neural networks with ReLU activation function at an arbitrary rate while requiring an exponentially growing (in the input dimension) number of samples for their numerical computation. The present study extends these findings by showing analogous results for the ReQU activation function.
MLNov 8, 2024
The sampling complexity of learning invertible residual neural networksYuanyuan Li, Philipp Grohs, Philipp Petersen
In recent work it has been shown that determining a feedforward ReLU neural network to within high uniform accuracy from point samples suffers from the curse of dimensionality in terms of the number of samples needed. As a consequence, feedforward ReLU neural networks are of limited use for applications where guaranteed high uniform accuracy is required. We consider the question of whether the sampling complexity can be improved by restricting the specific neural network architecture. To this end, we investigate invertible residual neural networks which are foundational architectures in deep learning and are widely employed in models that power modern generative methods. Our main result shows that the residual neural network architecture and invertibility do not help overcome the complexity barriers encountered with simpler feedforward architectures. Specifically, we demonstrate that the computational complexity of approximating invertible residual neural networks from point samples in the uniform norm suffers from the curse of dimensionality. Similar results are established for invertible convolutional Residual neural networks.
NEDec 20, 2021
Integral representations of shallow neural network with Rectified Power Unit activation functionAhmed Abdeljawad, Philipp Grohs
In this effort, we derive a formula for the integral representation of a shallow neural network with the Rectified Power Unit activation function. Mainly, our first result deals with the univariate case of representation capability of RePU shallow networks. The multidimensional result in this paper characterizes the set of functions that can be represented with bounded norm and possibly unbounded width.
FAOct 28, 2021
Sobolev-type embeddings for neural network approximation spacesPhilipp Grohs, Felix Voigtlaender
We consider neural network approximation spaces that classify functions according to the rate at which they can be approximated (with error measured in $L^p$) by ReLU neural networks with an increasing number of coefficients, subject to bounds on the magnitude of the coefficients and the number of hidden layers. We prove embedding theorems between these spaces for different values of $p$. Furthermore, we derive sharp embeddings of these approximation spaces into Hölder spaces. We find that, analogous to the case of classical function spaces (such as Sobolev spaces, or Besov spaces) it is possible to trade "smoothness" (i.e., approximation rate) for increased integrability. Combined with our earlier results in [arXiv:2104.02746], our embedding theorems imply a somewhat surprising fact related to "learning" functions from a given neural network space based on point samples: if accuracy is measured with respect to the uniform norm, then an optimal "learning" algorithm for reconstructing functions that are well approximable by ReLU neural networks is simply given by piecewise constant interpolation on a tensor product grid.
COMP-PHMay 18, 2021
Solving the electronic Schrödinger equation for multiple nuclear geometries with weight-sharing deep neural networksMichael Scherbela, Rafael Reisenhofer, Leon Gerard et al.
Accurate numerical solutions for the Schrödinger equation are of utmost importance in quantum chemistry. However, the computational cost of current high-accuracy methods scales poorly with the number of interacting particles. Combining Monte Carlo methods with unsupervised training of neural networks has recently been proposed as a promising approach to overcome the curse of dimensionality in this setting and to obtain accurate wavefunctions for individual molecules at a moderately scaling computational cost. These methods currently do not exploit the regularity exhibited by wavefunctions with respect to their molecular geometries. Inspired by recent successful applications of deep transfer learning in machine translation and computer vision tasks, we attempt to leverage this regularity by introducing a weight-sharing constraint when optimizing neural network-based models for different molecular geometries. That is, we restrict the optimization process such that up to 95 percent of weights in a neural network model are in fact equal across varying molecular geometries. We find that this technique can accelerate optimization when considering sets of nuclear geometries of the same molecule by an order of magnitude and that it opens a promising route towards pre-trained neural network wavefunctions that yield high accuracy even across different molecules.
LGMay 9, 2021
The Modern Mathematics of Deep LearningJulius Berner, Philipp Grohs, Gitta Kutyniok et al.
We describe the new field of mathematical analysis of deep learning. This field emerged around a list of research questions that were not answered within the classical framework of learning theory. These questions concern: the outstanding generalization power of overparametrized neural networks, the role of depth in deep architectures, the apparent absence of the curse of dimensionality, the surprisingly successful optimization performance despite the non-convexity of the problem, understanding what features are learned, why deep architectures perform exceptionally well in physical problems, and which fine aspects of an architecture affect the behavior of a learning task in which way. We present an overview of modern approaches that yield partial answers to these questions. For selected approaches, we describe the main ideas in more detail.
LGApr 6, 2021
Proof of the Theory-to-Practice Gap in Deep Learning via Sampling Complexity bounds for Neural Network Approximation SpacesPhilipp Grohs, Felix Voigtlaender
We study the computational complexity of (deterministic or randomized) algorithms based on point samples for approximating or integrating functions that can be well approximated by neural networks. Such algorithms (most prominently stochastic gradient descent and its variants) are used extensively in the field of deep learning. One of the most important problems in this field concerns the question of whether it is possible to realize theoretically provable neural network approximation rates by such algorithms. We answer this question in the negative by proving hardness results for the problems of approximation and integration on a novel class of neural network approximation spaces. In particular, our results confirm a conjectured and empirically observed theory-to-practice gap in deep learning. We complement our hardness results by showing that approximation rates of a comparable order of convergence are (at least theoretically) achievable.
NAMar 9, 2021
Deep neural network approximation for high-dimensional parabolic Hamilton-Jacobi-Bellman equationsPhilipp Grohs, Lukas Herrmann
The approximation of solutions to second order Hamilton--Jacobi--Bellman (HJB) equations by deep neural networks is investigated. It is shown that for HJB equations that arise in the context of the optimal control of certain Markov processes the solution can be approximated by deep neural networks without incurring the curse of dimension. The dynamics is assumed to depend affinely on the controls and the cost depends quadratically on the controls. The admissible controls take values in a bounded set.
LGDec 23, 2020
Approximations with deep neural networks in Sobolev time-spaceAhmed Abdeljawad, Philipp Grohs
Solutions of evolution equation generally lies in certain Bochner-Sobolev spaces, in which the solution may has regularity and integrability properties for the time variable that can be different for the space variables. Therefore, in this paper, we develop a framework shows that deep neural networks can approximate Sobolev-regular functions with respect to Bochner-Sobolev spaces. In our work we use the so-called Rectified Cubic Unit (ReCU) as an activation function in our networks, which allows us to deduce approximation results of the neural networks while avoiding issues caused by the non regularity of the most commonly used Rectivied Linear Unit (ReLU) activation function.
LGNov 9, 2020
Numerically Solving Parametric Families of High-Dimensional Kolmogorov Partial Differential Equations via Deep LearningJulius Berner, Markus Dablander, Philipp Grohs
We present a deep learning algorithm for the numerical solution of parametric families of high-dimensional linear Kolmogorov partial differential equations (PDEs). Our method is based on reformulating the numerical approximation of a whole family of Kolmogorov PDEs as a single statistical learning problem using the Feynman-Kac formula. Successful numerical experiments are presented, which empirically confirm the functionality and efficiency of our proposed algorithm in the case of heat equations and Black-Scholes option pricing models parametrized by affine-linear coefficient functions. We show that a single deep neural network trained on simulated data is capable of learning the solution functions of an entire family of PDEs on a full space-time region. Most notably, our numerical observations and theoretical results also demonstrate that the proposed method does not suffer from the curse of dimensionality, distinguishing it from almost all standard numerical methods for PDEs.
FAAug 3, 2020
Phase Transitions in Rate Distortion Theory and Deep LearningPhilipp Grohs, Andreas Klotz, Felix Voigtlaender
Rate distortion theory is concerned with optimally encoding a given signal class $\mathcal{S}$ using a budget of $R$ bits, as $R\to\infty$. We say that $\mathcal{S}$ can be compressed at rate $s$ if we can achieve an error of $\mathcal{O}(R^{-s})$ for encoding $\mathcal{S}$; the supremal compression rate is denoted $s^\ast(\mathcal{S})$. Given a fixed coding scheme, there usually are elements of $\mathcal{S}$ that are compressed at a higher rate than $s^\ast(\mathcal{S})$ by the given coding scheme; we study the size of this set of signals. We show that for certain "nice" signal classes $\mathcal{S}$, a phase transition occurs: We construct a probability measure $\mathbb{P}$ on $\mathcal{S}$ such that for every coding scheme $\mathcal{C}$ and any $s >s^\ast(\mathcal{S})$, the set of signals encoded with error $\mathcal{O}(R^{-s})$ by $\mathcal{C}$ forms a $\mathbb{P}$-null-set. In particular our results apply to balls in Besov and Sobolev spaces that embed compactly into $L^2(Ω)$ for a bounded Lipschitz domain $Ω$. As an application, we show that several existing sharpness results concerning function approximation using deep neural networks are generically sharp. We also provide quantitative and non-asymptotic bounds on the probability that a random $f\in\mathcal{S}$ can be encoded to within accuracy $\varepsilon$ using $R$ bits. This result is applied to the problem of approximately representing $f\in\mathcal{S}$ to within accuracy $\varepsilon$ by a (quantized) neural network that is constrained to have at most $W$ nonzero weights and is generated by an arbitrary "learning" procedure. We show that for any $s >s^\ast(\mathcal{S})$ there are constants $c,C$ such that, no matter how we choose the "learning" procedure, the probability of success is bounded from above by $\min\big\{1,2^{C\cdot W\lceil\log_2(1+W)\rceil^2 -c\cdot\varepsilon^{-1/s}}\big\}$.
NAJul 10, 2020
Deep neural network approximation for high-dimensional elliptic PDEs with boundary conditionsPhilipp Grohs, Lukas Herrmann
In recent work it has been established that deep neural networks are capable of approximating solutions to a large class of parabolic partial differential equations without incurring the curse of dimension. However, all this work has been restricted to problems formulated on the whole Euclidean domain. On the other hand, most problems in engineering and the sciences are formulated on finite domains and subjected to boundary conditions. The present paper considers an important such model problem, namely the Poisson equation on a domain $D\subset \mathbb{R}^d$ subject to Dirichlet boundary conditions. It is shown that deep neural networks are capable of representing solutions of that problem without incurring the curse of dimension. The proofs are based on a probabilistic representation of the solution to the Poisson equation as well as a suitable sampling method.
NANov 20, 2019
Uniform error estimates for artificial neural network approximations for heat equationsLukas Gonon, Philipp Grohs, Arnulf Jentzen et al.
Recently, artificial neural networks (ANNs) in conjunction with stochastic gradient descent optimization methods have been employed to approximately compute solutions of possibly rather high-dimensional partial differential equations (PDEs). Very recently, there have also been a number of rigorous mathematical results in the scientific literature which examine the approximation capabilities of such deep learning based approximation algorithms for PDEs. These mathematical results from the scientific literature prove in part that algorithms based on ANNs are capable of overcoming the curse of dimensionality in the numerical approximation of high-dimensional PDEs. In these mathematical results from the scientific literature usually the error between the solution of the PDE and the approximating ANN is measured in the $L^p$-sense with respect to some $p \in [1,\infty)$ and some probability measure. In many applications it is, however, also important to control the error in a uniform $L^\infty$-sense. The key contribution of the main result of this article is to develop the techniques to obtain error estimates between solutions of PDEs and approximating ANNs in the uniform $L^\infty$-sense. In particular, we prove that the number of parameters of an ANN to uniformly approximate the classical solution of the heat equation in a region $ [a,b]^d $ for a fixed time point $ T \in (0,\infty) $ grows at most polynomially in the dimension $ d \in \mathbb{N} $ and the reciprocal of the approximation precision $ \varepsilon > 0 $. This shows that ANNs can overcome the curse of dimensionality in the numerical approximation of the heat equation when the error is measured in the uniform $L^\infty$-norm.
NAAug 28, 2019
Deep neural network approximations for Monte Carlo algorithmsPhilipp Grohs, Arnulf Jentzen, Diyora Salimova
Recently, it has been proposed in the literature to employ deep neural networks (DNNs) together with stochastic gradient descent methods to approximate solutions of PDEs. There are also a few results in the literature which prove that DNNs can approximate solutions of certain PDEs without the curse of dimensionality in the sense that the number of real parameters used to describe the DNN grows at most polynomially both in the PDE dimension and the reciprocal of the prescribed approximation accuracy. One key argument in most of these results is, first, to use a Monte Carlo approximation scheme which can approximate the solution of the PDE under consideration at a fixed space-time point without the curse of dimensionality and, thereafter, to prove that DNNs are flexible enough to mimic the behaviour of the used approximation scheme. Having this in mind, one could aim for a general abstract result which shows under suitable assumptions that if a certain function can be approximated by any kind of (Monte Carlo) approximation scheme without the curse of dimensionality, then this function can also be approximated with DNNs without the curse of dimensionality. It is a key contribution of this article to make a first step towards this direction. In particular, the main result of this paper, essentially, shows that if a function can be approximated by means of some suitable discrete approximation scheme without the curse of dimensionality and if there exist DNNs which satisfy certain regularity properties and which approximate this discrete approximation scheme without the curse of dimensionality, then the function itself can also be approximated with DNNs without the curse of dimensionality. As an application of this result we establish that solutions of suitable Kolmogorov PDEs can be approximated with DNNs without the curse of dimensionality.
NAAug 11, 2019
Space-time error estimates for deep neural network approximations for differential equationsPhilipp Grohs, Fabian Hornung, Arnulf Jentzen et al.
Over the last few years deep artificial neural networks (DNNs) have very successfully been used in numerical simulations for a wide variety of computational problems including computer vision, image classification, speech recognition, natural language processing, as well as computational advertisement. In addition, it has recently been proposed to approximate solutions of partial differential equations (PDEs) by means of stochastic learning problems involving DNNs. There are now also a few rigorous mathematical results in the scientific literature which provide error estimates for such deep learning based approximation methods for PDEs. All of these articles provide spatial error estimates for neural network approximations for PDEs but do not provide error estimates for the entire space-time error for the considered neural network approximations. It is the subject of the main result of this article to provide space-time error estimates for DNN approximations of Euler approximations of certain perturbed differential equations. Our proof of this result is based (i) on a certain artificial neural network (ANN) calculus and (ii) on ANN approximation results for products of the form $[0,T]\times \mathbb{R}^d\ni (t,x)\mapsto tx\in \mathbb{R}^d$ where $T\in (0,\infty)$, $d\in \mathbb{N}$, which we both develop within this article.
LGMay 23, 2019
How degenerate is the parametrization of neural networks with the ReLU activation function?Julius Berner, Dennis Elbrächter, Philipp Grohs
Neural network training is usually accomplished by solving a non-convex optimization problem using stochastic gradient descent. Although one optimizes over the networks parameters, the main loss function generally only depends on the realization of the neural network, i.e. the function it computes. Studying the optimization problem over the space of realizations opens up new ways to understand neural network training. In particular, usual loss functions like mean squared error and categorical cross entropy are convex on spaces of neural network realizations, which themselves are non-convex. Approximation capabilities of neural networks can be used to deal with the latter non-convexity, which allows us to establish that for sufficiently large networks local minima of a regularized optimization problem on the realization space are almost optimal. Note, however, that each realization has many different, possibly degenerate, parametrizations. In particular, a local minimum in the parametrization space needs not correspond to a local minimum in the realization space. To establish such a connection, inverse stability of the realization map is required, meaning that proximity of realizations must imply proximity of corresponding parametrizations. We present pathologies which prevent inverse stability in general, and, for shallow networks, proceed to establish a restricted space of parametrizations on which we have inverse stability w.r.t. to a Sobolev norm. Furthermore, we show that by optimizing over such restricted sets, it is still possible to learn any function which can be learned by optimization over unrestricted sets.
LGMay 13, 2019
Towards a regularity theory for ReLU networks -- chain rule and global error estimatesJulius Berner, Dennis Elbrächter, Philipp Grohs et al.
Although for neural networks with locally Lipschitz continuous activation functions the classical derivative exists almost everywhere, the standard chain rule is in general not applicable. We will consider a way of introducing a derivative for neural networks that admits a chain rule, which is both rigorous and easy to work with. In addition we will present a method of converting approximation results on bounded domains to global (pointwise) estimates. This can be used to extend known neural network approximation theory to include the study of regularity properties. Of particular interest is the application to neural networks with ReLU activation function, where it contributes to the understanding of the success of deep learning methods for high-dimensional partial differential equations.
LGJan 17, 2019
The Oracle of DLphiDominik Alfke, Weston Baines, Jan Blechschmidt et al.
We present a novel technique based on deep learning and set theory which yields exceptional classification and prediction results. Having access to a sufficiently large amount of labelled training data, our methodology is capable of predicting the labels of the test data almost always even if the training data is entirely unrelated to the test data. In other words, we prove in a specific setting that as long as one has access to enough data points, the quality of the data is irrelevant.
LGJan 8, 2019
Deep Neural Network Approximation TheoryDennis Elbrächter, Dmytro Perekrestenko, Philipp Grohs et al.
This paper develops fundamental limits of deep neural network learning by characterizing what is possible if no constraints are imposed on the learning algorithm and on the amount of training data. Concretely, we consider Kolmogorov-optimal approximation through deep neural networks with the guiding theme being a relation between the complexity of the function (class) to be approximated and the complexity of the approximating network in terms of connectivity and memory requirements for storing the network topology and the associated quantized weights. The theory we develop establishes that deep networks are Kolmogorov-optimal approximants for markedly different function classes, such as unit balls in Besov spaces and modulation spaces. In addition, deep networks provide exponential approximation accuracy - i.e., the approximation error decays exponentially in the number of nonzero weights in the network - of the multiplication operation, polynomials, sinusoidal functions, and certain smooth functions. Moreover, this holds true even for one-dimensional oscillatory textures and the Weierstrass function - a fractal function, neither of which has previously known methods achieving exponential approximation accuracy. We also show that in the approximation of sufficiently smooth functions finite-width deep networks require strictly smaller connectivity than finite-depth wide networks.
LGSep 9, 2018
Analysis of the Generalization Error: Empirical Risk Minimization over Deep Artificial Neural Networks Overcomes the Curse of Dimensionality in the Numerical Approximation of Black-Scholes Partial Differential EquationsJulius Berner, Philipp Grohs, Arnulf Jentzen
The development of new classification and regression algorithms based on empirical risk minimization (ERM) over deep neural network hypothesis classes, coined deep learning, revolutionized the area of artificial intelligence, machine learning, and data analysis. In particular, these methods have been applied to the numerical solution of high-dimensional partial differential equations with great success. Recent simulations indicate that deep learning-based algorithms are capable of overcoming the curse of dimensionality for the numerical solution of Kolmogorov equations, which are widely used in models from engineering, finance, and the natural sciences. The present paper considers under which conditions ERM over a deep neural network hypothesis class approximates the solution of a $d$-dimensional Kolmogorov equation with affine drift and diffusion coefficients and typical initial values arising from problems in computational finance up to error $\varepsilon$. We establish that, with high probability over draws of training samples, such an approximation can be achieved with both the size of the hypothesis class and the number of training samples scaling only polynomially in $d$ and $\varepsilon^{-1}$. It can be concluded that ERM over deep neural network hypothesis classes overcomes the curse of dimensionality for the numerical solution of linear Kolmogorov equations with affine coefficients.
NASep 7, 2018
A proof that artificial neural networks overcome the curse of dimensionality in the numerical approximation of Black-Scholes partial differential equationsPhilipp Grohs, Fabian Hornung, Arnulf Jentzen et al.
Artificial neural networks (ANNs) have very successfully been used in numerical simulations for a series of computational problems ranging from image classification/image recognition, speech recognition, time series analysis, game intelligence, and computational advertising to numerical approximations of partial differential equations (PDEs). Such numerical simulations suggest that ANNs have the capacity to very efficiently approximate high-dimensional functions and, especially, indicate that ANNs seem to admit the fundamental power to overcome the curse of dimensionality when approximating the high-dimensional functions appearing in the above named computational problems. There are a series of rigorous mathematical approximation results for ANNs in the scientific literature. Some of them prove convergence without convergence rates and some even rigorously establish convergence rates but there are only a few special cases where mathematical results can rigorously explain the empirical success of ANNs when approximating high-dimensional functions. The key contribution of this article is to disclose that ANNs can efficiently approximate high-dimensional functions in the case of numerical approximations of Black-Scholes PDEs. More precisely, this work reveals that the number of required parameters of an ANN to approximate the solution of the Black-Scholes PDE grows at most polynomially in both the reciprocal of the prescribed approximation accuracy $\varepsilon > 0$ and the PDE dimension $d \in \mathbb{N}$. We thereby prove, for the first time, that ANNs do indeed overcome the curse of dimensionality in the numerical approximation of Black-Scholes PDEs.
LGJun 5, 2018
The universal approximation power of finite-width deep ReLU networksDmytro Perekrestenko, Philipp Grohs, Dennis Elbrächter et al.
We show that finite-width deep ReLU neural networks yield rate-distortion optimal approximation (Bölcskei et al., 2018) of polynomials, windowed sinusoidal functions, one-dimensional oscillatory textures, and the Weierstrass function, a fractal function which is continuous but nowhere differentiable. Together with their recently established universal approximation property of affine function systems (Bölcskei et al., 2018), this shows that deep neural networks approximate vastly different signal structures generated by the affine group, the Weyl-Heisenberg group, or through warping, and even certain fractals, all with approximation error decaying exponentially in the number of neurons. We also prove that in the approximation of sufficiently smooth functions finite-width deep networks require strictly smaller connectivity than finite-depth wide networks.
NAJun 1, 2018
Solving the Kolmogorov PDE by means of deep learningChristian Beck, Sebastian Becker, Philipp Grohs et al.
Stochastic differential equations (SDEs) and the Kolmogorov partial differential equations (PDEs) associated to them have been widely used in models from engineering, finance, and the natural sciences. In particular, SDEs and Kolmogorov PDEs, respectively, are highly employed in models for the approximative pricing of financial derivatives. Kolmogorov PDEs and SDEs, respectively, can typically not be solved explicitly and it has been and still is an active topic of research to design and analyze numerical methods which are able to approximately solve Kolmogorov PDEs and SDEs, respectively. Nearly all approximation methods for Kolmogorov PDEs in the literature suffer under the curse of dimensionality or only provide approximations of the solution of the PDE at a single fixed space-time point. In this paper we derive and propose a numerical approximation method which aims to overcome both of the above mentioned drawbacks and intends to deliver a numerical approximation of the Kolmogorov PDE on an entire region $[a,b]^d$ without suffering from the curse of dimensionality. Numerical results on examples including the heat equation, the Black-Scholes model, the stochastic Lorenz equation, and the Heston model suggest that the proposed approximation algorithm is quite effective in high dimensions in terms of both accuracy and speed.
MLJul 10, 2017
Topology Reduction in Deep Convolutional Feature Extraction NetworksThomas Wiatowski, Philipp Grohs, Helmut Bölcskei
Deep convolutional neural networks (CNNs) used in practice employ potentially hundreds of layers and $10$,$000$s of nodes. Such network sizes entail significant computational complexity due to the large number of convolutions that need to be carried out; in addition, a large number of parameters needs to be learned and stored. Very deep and wide CNNs may therefore not be well suited to applications operating under severe resource constraints as is the case, e.g., in low-power embedded and mobile platforms. This paper aims at understanding the impact of CNN topology, specifically depth and width, on the network's feature extraction capabilities. We address this question for the class of scattering networks that employ either Weyl-Heisenberg filters or wavelets, the modulus non-linearity, and no pooling. The exponential feature map energy decay results in Wiatowski et al., 2017, are generalized to $\mathcal{O}(a^{-N})$, where an arbitrary decay factor $a>1$ can be realized through suitable choice of the Weyl-Heisenberg prototype function or the mother wavelet. We then show how networks of fixed (possibly small) depth $N$ can be designed to guarantee that $((1-\varepsilon)\cdot 100)\%$ of the input signal's energy are contained in the feature vector. Based on the notion of operationally significant nodes, we characterize, partly rigorously and partly heuristically, the topology-reducing effects of (effectively) band-limited input signals, band-limited filters, and feature map symmetries. Finally, for networks based on Weyl-Heisenberg filters, we determine the prototype function bandwidth that minimizes---for fixed network depth $N$---the average number of operationally significant nodes per layer.
LGMay 4, 2017
Optimal Approximation with Sparsely Connected Deep Neural NetworksHelmut Bölcskei, Philipp Grohs, Gitta Kutyniok et al.
We derive fundamental lower bounds on the connectivity and the memory requirements of deep neural networks guaranteeing uniform approximation rates for arbitrary function classes in $L^2(\mathbb R^d)$. In other words, we establish a connection between the complexity of a function class and the complexity of deep neural networks approximating functions from this class to within a prescribed accuracy. Additionally, we prove that our lower bounds are achievable for a broad family of function classes. Specifically, all function classes that are optimally approximated by a general class of representation systems---so-called \emph{affine systems}---can be approximated by deep neural networks with minimal connectivity and memory requirements. Affine systems encompass a wealth of representation systems from applied harmonic analysis such as wavelets, ridgelets, curvelets, shearlets, $α$-shearlets, and more generally $α$-molecules. Our central result elucidates a remarkable universality property of neural networks and shows that they achieve the optimum approximation properties of all affine systems combined. As a specific example, we consider the class of $α^{-1}$-cartoon-like functions, which is approximated optimally by $α$-shearlets. We also explain how our results can be extended to the case of functions on low-dimensional immersed manifolds. Finally, we present numerical experiments demonstrating that the standard stochastic gradient descent algorithm generates deep neural networks providing close-to-optimal approximation rates. Moreover, these results indicate that stochastic gradient descent can actually learn approximations that are sparse in the representation systems optimally sparsifying the function class the network is trained on.
ITApr 12, 2017
Energy Propagation in Deep Convolutional Neural NetworksThomas Wiatowski, Philipp Grohs, Helmut Bölcskei
Many practical machine learning tasks employ very deep convolutional neural networks. Such large depths pose formidable computational challenges in training and operating the network. It is therefore important to understand how fast the energy contained in the propagated signals (a.k.a. feature maps) decays across layers. In addition, it is desirable that the feature extractor generated by the network be informative in the sense of the only signal mapping to the all-zeros feature vector being the zero input signal. This "trivial null-set" property can be accomplished by asking for "energy conservation" in the sense of the energy in the feature vector being proportional to that of the corresponding input signal. This paper establishes conditions for energy conservation (and thus for a trivial null-set) for a wide class of deep convolutional neural network-based feature extractors and characterizes corresponding feature map energy decay rates. Specifically, we consider general scattering networks employing the modulus non-linearity and we find that under mild analyticity and high-pass conditions on the filters (which encompass, inter alia, various constructions of Weyl-Heisenberg filters, wavelets, ridgelets, ($α$)-curvelets, and shearlets) the feature map energy decays at least polynomially fast. For broad families of wavelets and Weyl-Heisenberg filters, the guaranteed decay rate is shown to be exponential. Moreover, we provide handy estimates of the number of layers needed to have at least $((1-\varepsilon)\cdot 100)\%$ of the input signal energy be contained in the feature vector.
LGMay 26, 2016
Discrete Deep Feature Extraction: A Theory and New ArchitecturesThomas Wiatowski, Michael Tschannen, Aleksandar Stanić et al.
First steps towards a mathematical theory of deep convolutional neural networks for feature extraction were made---for the continuous-time case---in Mallat, 2012, and Wiatowski and Bölcskei, 2015. This paper considers the discrete case, introduces new convolutional neural network architectures, and proposes a mathematical framework for their analysis. Specifically, we establish deformation and translation sensitivity results of local and global nature, and we investigate how certain structural properties of the input signal are reflected in the corresponding feature vectors. Our theory applies to general filters and general Lipschitz-continuous non-linearities and pooling operators. Experiments on handwritten digit classification and facial landmark detection---including feature importance evaluation---complement the theoretical findings.
LGApr 29, 2016
Deep Convolutional Neural Networks on Cartoon FunctionsPhilipp Grohs, Thomas Wiatowski, Helmut Bölcskei
Wiatowski and Bölcskei, 2015, proved that deformation stability and vertical translation invariance of deep convolutional neural network-based feature extractors are guaranteed by the network structure per se rather than the specific convolution kernels and non-linearities. While the translation invariance result applies to square-integrable functions, the deformation stability bound holds for band-limited functions only. Many signals of practical relevance (such as natural images) exhibit, however, sharp and curved discontinuities and are, hence, not band-limited. The main contribution of this paper is a deformation stability result that takes these structural properties into account. Specifically, we establish deformation stability bounds for the class of cartoon functions introduced by Donoho, 2001.