LGAug 9, 2022
On the Activation Function Dependence of the Spectral Bias of Neural NetworksQingguo Hong, Jonathan W. Siegel, Qinyang Tan et al.
Neural networks are universal function approximators which are known to generalize well despite being dramatically overparameterized. We study this phenomenon from the point of view of the spectral bias of neural networks. Our contributions are two-fold. First, we provide a theoretical explanation for the spectral bias of ReLU neural networks by leveraging connections with the theory of finite element methods. Second, based upon this theory we predict that switching the activation function to a piecewise linear B-spline, namely the Hat function, will remove this spectral bias, which we verify empirically in a variety of settings. Our empirical studies also show that neural networks with the Hat activation function are trained significantly faster using stochastic gradient descent and ADAM. Combined with previous work showing that the Hat activation function also improves generalization accuracy on image classification tasks, this indicates that using the Hat activation provides significant advantages over the ReLU on certain problems.
97.6NAJun 2
Sampling and reconstruction of convex functionsAndrea Bonito, Albert Cohen, Wolfgang Dahmen et al.
We discuss optimal recovery for classes of multivariate convex functions from given point samples, as well as the sampling numbers of these classes, corresponding to optimal sample choices. Upper and lower bounds for either variant are established when the reconstruction error is measured in $L_p$ for $1\leq p\leq \infty$. These bounds match, sometimes up to logarithmic factors, and therefore characterize the respective optimal rate of decay. For classical smoothness classes such as Sobolev, Hölder or Besov spaces, it is well known that the optimal decay rate of sampling numbers can be achieved by sampling on uniform tensor product grids and using linear methods of reconstruction, such as piecewise polynomial interpolation. One of the main findings in this paper is that for classes of convex functions, these procedures generally produce suboptimal rates, except when $p=1$ and $p=\infty$, and are outperformed by nonlinear reconstruction methods that do not employ tensor product grids.
MLJul 28, 2023
Weighted variation spaces and approximation by shallow ReLU networksRonald DeVore, Robert D. Nowak, Rahul Parhi et al.
We investigate the approximation of functions $f$ on a bounded domain $Ω\subset \mathbb{R}^d$ by the outputs of single-hidden-layer ReLU neural networks of width $n$. This form of nonlinear $n$-term dictionary approximation has been intensely studied since it is the simplest case of neural network approximation (NNA). There are several celebrated approximation results for this form of NNA that introduce novel model classes of functions on $Ω$ whose approximation rates do not grow unbounded with the input dimension. These novel classes include Barron classes, and classes based on sparsity or variation such as the Radon-domain BV classes. The present paper is concerned with the definition of these novel model classes on domains $Ω$. The current definition of these model classes does not depend on the domain $Ω$. A new and more proper definition of model classes on domains is given by introducing the concept of weighted variation spaces. These new model classes are intrinsic to the domain itself. The importance of these new model classes is that they are strictly larger than the classical (domain-independent) classes. Yet, it is shown that they maintain the same NNA rates.
MLNov 25, 2022
Optimal Approximation Rates for Deep ReLU Neural Networks on Sobolev and Besov SpacesJonathan W. Siegel
Let $Ω= [0,1]^d$ be the unit cube in $\mathbb{R}^d$. We study the problem of how efficiently, in terms of the number of parameters, deep neural networks with the ReLU activation function can approximate functions in the Sobolev spaces $W^s(L_q(Ω))$ and Besov spaces $B^s_r(L_q(Ω))$, with error measured in the $L_p(Ω)$ norm. This problem is important when studying the application of neural networks in a variety of fields, including scientific computing and signal processing, and has previously been solved only when $p=q=\infty$. Our contribution is to provide a complete solution for all $1\leq p,q\leq \infty$ and $s > 0$ for which the corresponding Sobolev or Besov space compactly embeds into $L_p$. The key technical tool is a novel bit-extraction technique which gives an optimal encoding of sparse vectors. This enables us to obtain sharp upper bounds in the non-linear regime where $p > q$. We also provide a novel method for deriving $L_p$-approximation lower bounds based upon VC-dimension when $p < \infty$. Our results show that very deep ReLU networks significantly outperform classical methods of approximation in terms of the number of parameters, but that this comes at the cost of parameters which are not encodable.
MLJul 28, 2023
Optimal Approximation of Zonoids and Uniform Approximation by Shallow Neural NetworksJonathan W. Siegel
We study the following two related problems. The first is to determine to what error an arbitrary zonoid in $\mathbb{R}^{d+1}$ can be approximated in the Hausdorff distance by a sum of $n$ line segments. The second is to determine optimal approximation rates in the uniform norm for shallow ReLU$^k$ neural networks on their variation spaces. The first of these problems has been solved for $d\neq 2,3$, but when $d=2,3$ a logarithmic gap between the best upper and lower bounds remains. We close this gap, which completes the solution in all dimensions. For the second problem, our techniques significantly improve upon existing approximation rates when $k\geq 1$, and enable uniform approximation of both the target function and its derivatives.
MLAug 20, 2024
Approximation Rates for Shallow ReLU$^k$ Neural Networks on Sobolev Spaces via the Radon TransformTong Mao, Jonathan W. Siegel, Jinchao Xu
Let $Ω\subset \mathbb{R}^d$ be a bounded domain. We consider the problem of how efficiently shallow neural networks with the ReLU$^k$ activation function can approximate functions from Sobolev spaces $W^s(L_p(Ω))$ with error measured in the $L_q(Ω)$-norm. Utilizing the Radon transform and recent results from discrepancy theory, we provide a simple proof of nearly optimal approximation rates in a variety of cases, including when $q\leq p$, $p\geq 2$, and $s \leq k + (d+1)/2$. The rates we derive are optimal up to logarithmic factors, and significantly generalize existing results. An interesting consequence is that the adaptivity of shallow ReLU$^k$ neural networks enables them to obtain optimal approximation rates for smoothness up to order $s = k + (d+1)/2$, even though they represent piecewise polynomials of fixed degree $k$.
LGDec 18, 2025
In-Context Multi-Operator Learning with DeepOSetsShao-Ting Chiu, Aditya Nambiar, Ali Syed et al.
An important application of neural networks to scientific computing has been the learning of non-linear operators. In this framework, a neural network is trained to fit a non-linear map between two infinite dimensional spaces, for example, the solution operator of ordinary and partial differential equations. Recently, inspired by the discovery of in-context learning for large language models, an even more ambitious paradigm has been explored, called multi-operator learning. In this approach, a neural network is trained to learn many different operators at the same time. In order to evaluate one of the learned operators, the network is passed example inputs and outputs to disambiguate the desired operator. In this work, we provide a precise mathematical formulation of the multi-operator learning problem. In addition, we modify a simple efficient architecture, called DeepOSets, for multi-operator learning and prove its universality for multi-operator learning. Finally, we provide a comprehensive set of experiments that demonstrate the ability of DeepOSets to learn multiple operators corresponding to different initial-value and boundary-value differential equations and use in-context examples to predict accurately the solutions corresponding to queries and differential equations not seen during training. The main advantage of DeepOSets is its architectural simplicity, which allows the derivation of theoretical guarantees and training times that are in the order of minutes, in contrast to similar transformer-based alternatives that are empirically justified and require hours of training.
MLFeb 10, 2023
Nesterov acceleration despite very noisy gradientsKanan Gupta, Jonathan W. Siegel, Stephan Wojtowytsch
We present a generalization of Nesterov's accelerated gradient descent algorithm. Our algorithm (AGNES) provably achieves acceleration for smooth convex and strongly convex minimization tasks with noisy gradient estimates if the noise intensity is proportional to the magnitude of the gradient at every point. Nesterov's method converges at an accelerated rate if the constant of proportionality is below 1, while AGNES accommodates any signal-to-noise ratio. The noise model is motivated by applications in overparametrized machine learning. AGNES requires only two parameters in convex and three in strongly convex minimization tasks, improving on existing methods. We further provide clear geometric interpretations and heuristics for the choice of parameters.
MLJul 15, 2023
Sharp Convergence Rates for Matching PursuitJason M. Klusowski, Jonathan W. Siegel
We study the fundamental limits of matching pursuit, or the pure greedy algorithm, for approximating a target function $ f $ by a linear combination $f_n$ of $n$ elements from a dictionary. When the target function is contained in the variation space corresponding to the dictionary, many impressive works over the past few decades have obtained upper and lower bounds on the error $\|f-f_n\|$ of matching pursuit, but they do not match. The main contribution of this paper is to close this gap and obtain a sharp characterization of the decay rate, $n^{-α}$, of matching pursuit. Specifically, we construct a worst case dictionary which shows that the existing best upper bound cannot be significantly improved. It turns out that, unlike other greedy algorithm variants which converge at the optimal rate $ n^{-1/2}$, the convergence rate $n^{-α}$ is suboptimal. Here, $α\approx 0.182$ is determined by the solution to a certain non-linear equation.
LGFeb 2, 2023
Sharp Lower Bounds on Interpolation by Deep ReLU Neural Networks at Irregularly Spaced DataJonathan W. Siegel
We study the interpolation power of deep ReLU neural networks. Specifically, we consider the question of how efficiently, in terms of the number of parameters, deep ReLU networks can interpolate values at $N$ datapoints in the unit ball which are separated by a distance $δ$. We show that $Ω(N)$ parameters are required in the regime where $δ$ is exponentially small in $N$, which gives the sharp result in this regime since $O(N)$ parameters are always sufficient. This also shows that the bit-extraction technique used to prove lower bounds on the VC dimension cannot be applied to irregularly spaced datapoints. Finally, as an application we give a lower bound on the approximation rates that deep ReLU neural networks can achieve for Sobolev spaces at the embedding endpoint.
LGFeb 23
Quantitative Approximation Rates for Group Equivariant LearningJonathan W. Siegel, Snir Hordan, Hannah Lawrence et al.
The universal approximation theorem establishes that neural networks can approximate any continuous function on a compact set. Later works in approximation theory provide quantitative approximation rates for ReLU networks on the class of $α$-Hölder functions $f: [0,1]^N \to \mathbb{R}$. The goal of this paper is to provide similar quantitative approximation results in the context of group equivariant learning, where the learned $α$-Hölder function is known to obey certain group symmetries. While there has been much interest in the literature in understanding the universal approximation properties of equivariant models, very few quantitative approximation results are known for equivariant models. In this paper, we bridge this gap by deriving quantitative approximation rates for several prominent group-equivariant and invariant architectures. The architectures that we consider include: the permutation-invariant Deep Sets architecture; the permutation-equivariant Sumformer and Transformer architectures; joint invariance to permutations and rigid motions using invariant networks based on frame averaging; and general bi-Lipschitz invariant models. Overall, we show that equally-sized ReLU MLPs and equivariant architectures are equally expressive over equivariant functions. Thus, hard-coding equivariance does not result in a loss of expressivity or approximation power in these models.
OCOct 26, 2023
A qualitative difference between gradient flows of convex functions in finite- and infinite-dimensional Hilbert spacesJonathan W. Siegel, Stephan Wojtowytsch
We consider gradient flow/gradient descent and heavy ball/accelerated gradient descent optimization for convex objective functions. In the gradient flow case, we prove the following: 1. If $f$ does not have a minimizer, the convergence $f(x_t)\to \inf f$ can be arbitrarily slow. 2. If $f$ does have a minimizer, the excess energy $f(x_t) - \inf f$ is integrable/summable in time. In particular, $f(x_t) - \inf f = o(1/t)$ as $t\to\infty$. 3. In Hilbert spaces, this is optimal: $f(x_t) - \inf f$ can decay to $0$ as slowly as any given function which is monotone decreasing and integrable at $\infty$, even for a fixed quadratic objective. 4. In finite dimension (or more generally, for all gradient flow curves of finite length), this is not optimal: We prove that there are convex monotone decreasing integrable functions $g(t)$ which decrease to zero slower than $f(x_t)-\inf f$ for the gradient flow of any convex function on $\mathbb R^d$. For instance, we show that any gradient flow $x_t$ of a convex function $f$ in finite dimension satisfies $\liminf_{t\to\infty} \big(t\cdot \log^2(t)\cdot \big\{f(x_t) -\inf f\big\}\big)=0$. This improves on the commonly reported $O(1/t)$ rate and provides a sharp characterization of the energy decay law. We also note that it is impossible to establish a rate $O(1/(tφ(t))$ for any function $φ$ which satisfies $\lim_{t\to\infty}φ(t) = \infty$, even asymptotically. Similar results are obtained in related settings for (1) discrete time gradient descent, (2) stochastic gradient descent with multiplicative noise and (3) the heavy ball ODE. In the case of stochastic gradient descent, the summability of $\mathbb E[f(x_n) - \inf f]$ is used to prove that $f(x_n)\to \inf f$ almost surely - an improvement on the convergence almost surely up to a subsequence which follows from the $O(1/n)$ decay estimate.
22.5LGMay 8
Embedding Dimension Lower Bounds for Universality of Deep Sets and Janossy PoolingAli Syed, Aditya Nambiar, Jonathan W. Siegel
In many practical applications it is important to build symmetries into neural network architectures. Consider the important case of permutation symmetry on point clouds consisting of $n$ points in $d$ dimensions. In this case the network learns a function on a set of $n$ points in $\mathbb{R}^d$, and a natural paradigm for constructing invariant networks is Janossy pooling, which generalizes the popular Deep Sets architecture. We study the universality of this approach, in particular the important question of how large the embedding dimension must be to guarantee universality of this architecture. Specifically, using a novel technique, we prove new lower bounds on the required size of this embedding dimension. For Deep Sets, this gives the correct minimal dimension up to a constant factor for all $d > 1$. For $k$-ary Janossy pooling, we prove the first non-trivial lower bound on the required embedding dimension when $k > 1$.
LGFeb 25, 2024
Equivariant Frames and the Impossibility of Continuous CanonicalizationNadav Dym, Hannah Lawrence, Jonathan W. Siegel
Canonicalization provides an architecture-agnostic method for enforcing equivariance, with generalizations such as frame-averaging recently gaining prominence as a lightweight and flexible alternative to equivariant architectures. Recent works have found an empirical benefit to using probabilistic frames instead, which learn weighted distributions over group elements. In this work, we provide strong theoretical justification for this phenomenon: for commonly-used groups, there is no efficiently computable choice of frame that preserves continuity of the function being averaged. In other words, unweighted frame-averaging can turn a smooth, non-symmetric function into a discontinuous, symmetric function. To address this fundamental robustness problem, we formally define and construct \emph{weighted} frames, which provably preserve continuity, and demonstrate their utility by constructing efficient and continuous weighted frames for the actions of $SO(2)$, $SO(3)$, and $S_n$ on point clouds.
MLJun 28, 2021
Characterization of the Variation Spaces Corresponding to Shallow Neural NetworksJonathan W. Siegel, Jinchao Xu
We study the variation space corresponding to a dictionary of functions in $L^2(Ω)$ for a bounded domain $Ω\subset \mathbb{R}^d$. Specifically, we compare the variation space, which is defined in terms of a convex hull with related notions based on integral representations. This allows us to show that three important notions relating to the approximation theory of shallow neural networks, the Barron space, the spectral Barron space, and the Radon BV space, are actually variation spaces with respect to certain natural dictionaries.
MLJun 28, 2021
Sharp Lower Bounds on the Approximation Rate of Shallow Neural NetworksJonathan W. Siegel, Jinchao Xu
We consider the approximation rates of shallow neural networks with respect to the variation norm. Upper bounds on these rates have been established for sigmoidal and ReLU activation functions, but it has remained an important open problem whether these rates are sharp. In this article, we provide a solution to this problem by proving sharp lower bounds on the approximation rates for shallow neural networks, which are obtained by lower bounding the $L^2$-metric entropy of the convex hull of the neural network basis functions. In addition, our methods also give sharp lower bounds on the Kolmogorov $n$-widths of this convex hull, which show that the variation spaces corresponding to shallow neural networks cannot be efficiently approximated by linear methods. These lower bounds apply to both sigmoidal activation functions with bounded variation and to activation functions which are a power of the ReLU. Our results also quantify how much stronger the Barron spectral norm is than the variation norm and, combined with previous results, give the asymptotics of the $L^\infty$-metric entropy up to logarithmic factors in the case of the ReLU activation function.
MLJan 29, 2021
Sharp Bounds on the Approximation Rates, Metric Entropy, and $n$-widths of Shallow Neural NetworksJonathan W. Siegel, Jinchao Xu
In this article, we study approximation properties of the variation spaces corresponding to shallow neural networks with a variety of activation functions. We introduce two main tools for estimating the metric entropy, approximation rates, and $n$-widths of these spaces. First, we introduce the notion of a smoothly parameterized dictionary and give upper bounds on the non-linear approximation rates, metric entropy and $n$-widths of their absolute convex hull. The upper bounds depend upon the order of smoothness of the parameterization. This result is applied to dictionaries of ridge functions corresponding to shallow neural networks, and they improve upon existing results in many cases. Next, we provide a method for lower bounding the metric entropy and $n$-widths of variation spaces which contain certain classes of ridge functions. This result gives sharp lower bounds on the $L^2$-approximation rates, metric entropy, and $n$-widths for variation spaces corresponding to neural networks with a range of important activation functions, including ReLU$^k$ activation functions and sigmoidal activation functions with bounded variation.
CVAug 21, 2020
Training Sparse Neural Networks using Compressed SensingJonathan W. Siegel, Jianhong Chen, Pengchuan Zhang et al.
Pruning the weights of neural networks is an effective and widely-used technique for reducing model size and inference complexity. We develop and test a novel method based on compressed sensing which combines the pruning and training into a single step. Specifically, we utilize an adaptively weighted $\ell^1$ penalty on the weights during training, which we combine with a generalization of the regularized dual averaging (RDA) algorithm in order to train sparse neural networks. The adaptive weighting we introduce corresponds to a novel regularizer based on the logarithm of the absolute value of the weights. We perform a series of ablation studies demonstrating the improvement provided by the adaptive weighting and generalized RDA algorithm. Furthermore, numerical experiments on the CIFAR-10, CIFAR-100, and ImageNet datasets demonstrate that our method 1) trains sparser, more accurate networks than existing state-of-the-art methods; 2) can be used to train sparse networks from scratch, i.e. from a random initialization, as opposed to initializing with a well-trained base model; 3) acts as an effective regularizer, improving generalization accuracy.
CAApr 4, 2019
Approximation Rates for Neural Networks with General Activation FunctionsJonathan W. Siegel, Jinchao Xu
We prove some new results concerning the approximation rate of neural networks with general activation functions. Our first result concerns the rate of approximation of a two layer neural network with a polynomially-decaying non-sigmoidal activation function. We extend the dimension independent approximation rates previously obtained to this new class of activation functions. Our second result gives a weaker, but still dimension independent, approximation rate for a larger class of activation functions, removing the polynomial decay assumption. This result applies to any bounded, integrable activation function. Finally, we show that a stratified sampling approach can be used to improve the approximation rate for polynomially decaying activation functions under mild additional assumptions.