27.0LGJun 3
A prism hierarchy of learning regimes in large linear autoencodersEugene Golikov, Yaroslav Gusev, Dmitry Yarotsky
Theoretical studies of machine learning models commonly consider different limiting regimes in which the learning dynamics of gradient descent becomes theoretically tractable. It is, however, desirable to have a systematically obtained picture of all qualitatively different extreme learning regimes for a particular type of models. In this paper we propose such a picture for large weight-tied linear autoencoders characterized by input and latent dimensions, initialization magnitude, and training set size. This model is nonlinear in the weights and its gradient flow does not have a general theoretical solution. We show that at the level of the formal loss-expansion hierarchy, its extreme regimes are naturally associated with faces of a triangular prism. In particular, there are five basic extreme regimes associated with the 2-faces of the prism: (1) large-data, (2) small-data, (3) mean-field, (4) narrow-latent, and (5) free. For regimes (1,2,3,4), we derive explicit expressions for both train and population limiting loss evolutions under gradient flow, obtaining very good agreement with experimental results.
LGJun 22, 2022
A view of mini-batch SGD via generating functions: conditions of convergence, phase transitions, benefit from negative momentaMaksim Velikanov, Denis Kuznedelev, Dmitry Yarotsky
Mini-batch SGD with momentum is a fundamental algorithm for learning large predictive models. In this paper we develop a new analytic framework to analyze noise-averaged properties of mini-batch SGD for linear models at constant learning rates, momenta and sizes of batches. Our key idea is to consider the dynamics of the second moments of model parameters for a special family of "Spectrally Expressible" approximations. This allows to obtain an explicit expression for the generating function of the sequence of loss values. By analyzing this generating function, we find, in particular, that 1) the SGD dynamics exhibits several convergent and divergent regimes depending on the spectral distributions of the problem; 2) the convergent regimes admit explicit stability conditions, and explicit loss asymptotics in the case of power-law spectral distributions; 3) the optimal convergence rate can be achieved at negative momenta. We verify our theoretical predictions by extensive experiments with MNIST, CIFAR10 and synthetic problems, and find a good quantitative agreement.
LGNov 7, 2023
Structure of universal formulasDmitry Yarotsky
By universal formulas we understand parameterized analytic expressions that have a fixed complexity, but nevertheless can approximate any continuous function on a compact set. There exist various examples of such formulas, including some in the form of neural networks. In this paper we analyze the essential structural elements of these highly expressive models. We introduce a hierarchy of expressiveness classes connecting the global approximability property to the weaker property of infinite VC dimension, and prove a series of classification results for several increasingly complex functional families. In particular, we introduce a general family of polynomially-exponentially-algebraic functions that, as we prove, is subject to polynomial constraints. As a consequence, we show that fixed-size neural networks with not more than one layer of neurons having transcendental activations (e.g., sine or standard sigmoid) cannot in general approximate functions on arbitrary finite sets. On the other hand, we give examples of functional families, including two-hidden-layer neural networks, that approximate functions on arbitrary finite sets, but fail to do that on the whole domain of definition.
OCApr 16, 2025
Corner Gradient DescentDmitry Yarotsky
We consider SGD-type optimization on infinite-dimensional quadratic problems with power law spectral conditions. It is well-known that on such problems deterministic GD has loss convergence rates $L_t=O(t^{-ζ})$, which can be improved to $L_t=O(t^{-2ζ})$ by using Heavy Ball with a non-stationary Jacobi-based schedule (and the latter rate is optimal among fixed schedules). However, in the mini-batch Stochastic GD setting, the sampling noise causes the Jacobi HB to diverge; accordingly no $O(t^{-2ζ})$ algorithm is known. In this paper we show that rates up to $O(t^{-2ζ})$ can be achieved by a generalized stationary SGD with infinite memory. We start by identifying generalized (S)GD algorithms with contours in the complex plane. We then show that contours that have a corner with external angle $θπ$ accelerate the plain GD rate $O(t^{-ζ})$ to $O(t^{-θζ})$. For deterministic GD, increasing $θ$ allows to achieve rates arbitrarily close to $O(t^{-2ζ})$. However, in Stochastic GD, increasing $θ$ also amplifies the sampling noise, so in general $θ$ needs to be optimized by balancing the acceleration and noise effects. We prove that the optimal rate is given by $θ_{\max}=\min(2,ν,\tfrac{2}{ζ+1/ν})$, where $ν,ζ$ are the exponents appearing in the capacity and source spectral conditions. Furthermore, using fast rational approximations of the power functions, we show that ideal corner algorithms can be efficiently approximated by finite-memory algorithms, and demonstrate their practical efficiency on a synthetic problem and MNIST.
LGMar 18, 2024
Generalization error of spectral algorithmsMaksim Velikanov, Maxim Panov, Dmitry Yarotsky
The asymptotically precise estimation of the generalization of kernel methods has recently received attention due to the parallels between neural networks and their associated kernels. However, prior works derive such estimates for training by kernel ridge regression (KRR), whereas neural networks are typically trained with gradient descent (GD). In the present work, we consider the training of kernels with a family of $\textit{spectral algorithms}$ specified by profile $h(λ)$, and including KRR and GD as special cases. Then, we derive the generalization error as a functional of learning profile $h(λ)$ for two data models: high-dimensional Gaussian and low-dimensional translation-invariant model. Under power-law assumptions on the spectrum of the kernel and target, we use our framework to (i) give full loss asymptotics for both noisy and noiseless observations (ii) show that the loss localizes on certain spectral scales, giving a new perspective on the KRR saturation phenomenon (iii) conjecture, and demonstrate for the considered data models, the universality of the loss w.r.t. non-spectral details of the problem, but only in case of noisy observation.
MLFeb 26, 2024
Learnability of high-dimensional targets by two-parameter models and gradient flowDmitry Yarotsky
We explore the theoretical possibility of learning $d$-dimensional targets with $W$-parameter models by gradient flow (GF) when $W<d$. Our main result shows that if the targets are described by a particular $d$-dimensional probability distribution, then there exist models with as few as two parameters that can learn the targets with arbitrarily high success probability. On the other hand, we show that for $W<d$ there is necessarily a large subset of GF-non-learnable targets. In particular, the set of learnable targets is not dense in $\mathbb R^d$, and any subset of $\mathbb R^d$ homeomorphic to the $W$-dimensional sphere contains non-learnable targets. Finally, we observe that the model in our main theorem on almost guaranteed two-parameter learning is constructed using a hierarchical procedure and as a result is not expressible by a single elementary function. We show that this limitation is essential in the sense that most models written in terms of elementary functions cannot achieve the learnability demonstrated in this theorem.
LGFeb 4
Gradient Flow Through Diagram Expansions: Learning Regimes and Explicit SolutionsDmitry Yarotsky, Eugene Golikov, Yaroslav Gusev
We develop a general mathematical framework to analyze scaling regimes and derive explicit analytic solutions for gradient flow (GF) in large learning problems. Our key innovation is a formal power series expansion of the loss evolution, with coefficients encoded by diagrams akin to Feynman diagrams. We show that this expansion has a well-defined large-size limit that can be used to reveal different learning phases and, in some cases, to obtain explicit solutions of the nonlinear GF. We focus on learning Canonical Polyadic (CP) decompositions of high-order tensors, and show that this model has several distinct extreme lazy and rich GF regimes such as free evolution, NTK and under- and over-parameterized mean-field. We show that these regimes depend on the parameter scaling, tensor order, and symmetry of the model in a specific and subtle way. Moreover, we propose a general approach to summing the formal loss expansion by reducing it to a PDE; in a wide range of scenarios, it turns out to be 1st order and solvable by the method of characteristics. We observe a very good agreement of our theoretical predictions with experiment.
MLFeb 24, 2022
Embedded Ensembles: Infinite Width Limit and Operating RegimesMaksim Velikanov, Roman Kail, Ivan Anokhin et al.
A memory efficient approach to ensembling neural networks is to share most weights among the ensembled models by means of a single reference network. We refer to this strategy as Embedded Ensembling (EE); its particular examples are BatchEnsembles and Monte-Carlo dropout ensembles. In this paper we perform a systematic theoretical and empirical analysis of embedded ensembles with different number of models. Theoretically, we use a Neural-Tangent-Kernel-based approach to derive the wide network limit of the gradient descent dynamics. In this limit, we identify two ensemble regimes - independent and collective - depending on the architecture and initialization strategy of ensemble models. We prove that in the independent regime the embedded ensemble behaves as an ensemble of independent models. We confirm our theoretical prediction with a wide range of experiments with finite networks, and further study empirically various effects such as transition between the two regimes, scaling of ensemble performance with the network width and number of models, and dependence of performance on a number of architecture and hyperparameter choices.
OCFeb 2, 2022
Tight Convergence Rate Bounds for Optimization Under Power Law Spectral ConditionsMaksim Velikanov, Dmitry Yarotsky
Performance of optimization on quadratic problems sensitively depends on the low-lying part of the spectrum. For large (effectively infinite-dimensional) problems, this part of the spectrum can often be naturally represented or approximated by power law distributions, resulting in power law convergence rates for iterative solutions of these problems by gradient-based algorithms. In this paper, we propose a new spectral condition providing tighter upper bounds for problems with power law optimization trajectories. We use this condition to build a complete picture of upper and lower bounds for a wide range of optimization algorithms -- Gradient Descent, Steepest Descent, Heavy Ball, and Conjugate Gradients -- with an emphasis on the underlying schedules of learning rate and momentum. In particular, we demonstrate how an optimally accelerated method, its schedule, and convergence upper bound can be obtained in a unified manner for a given shape of the spectrum. Also, we provide first proofs of tight lower bounds for convergence rates of Steepest Descent and Conjugate Gradients under spectral power laws with general exponents. Our experiments show that the obtained convergence bounds and acceleration strategies are not only relevant for exactly quadratic optimization problems, but also fairly accurate when applied to the training of neural networks.
LGMay 2, 2021
Universal scaling laws in the gradient descent training of neural networksMaksim Velikanov, Dmitry Yarotsky
Current theoretical results on optimization trajectories of neural networks trained by gradient descent typically have the form of rigorous but potentially loose bounds on the loss values. In the present work we take a different approach and show that the learning trajectory can be characterized by an explicit asymptotic at large training times. Specifically, the leading term in the asymptotic expansion of the loss behaves as a power law $L(t) \sim t^{-ξ}$ with exponent $ξ$ expressed only through the data dimension, the smoothness of the activation function, and the class of function being approximated. Our results are based on spectral analysis of the integral operator representing the linearized evolution of a large network trained on the expected loss. Importantly, the techniques we employ do not require specific form of a data distribution, for example Gaussian, thus making our findings sufficiently universal.
NEFeb 22, 2021
Elementary superexpressive activationsDmitry Yarotsky
We call a finite family of activation functions superexpressive if any multivariate continuous function can be approximated by a neural network that uses these activations and has a fixed architecture only depending on the number of input variables (i.e., to achieve any accuracy we only need to adjust the weights, without increasing the number of neurons). Previously, it was known that superexpressive activations exist, but their form was quite complex. We give examples of very simple superexpressive families: for example, we prove that the family {sin, arcsin} is superexpressive. We also show that most practical activations (not involving periodic functions) are not superexpressive.
LGAug 3, 2020
Low-loss connection of weight vectors: distribution-based approachesIvan Anokhin, Dmitry Yarotsky
Recent research shows that sublevel sets of the loss surfaces of overparameterized networks are connected, exactly or approximately. We describe and compare experimentally a panel of methods used to connect two low-loss points by a low-loss curve on this surface. Our methods vary in accuracy and complexity. Most of our methods are based on "macroscopic" distributional assumptions, and some are insensitive to the detailed properties of the points to be connected. Some methods require a prior training of a "global connection model" which can then be applied to any pair of points. The accuracy of the method generally correlates with its complexity and sensitivity to the endpoint detail.
NEJun 22, 2019
The phase diagram of approximation rates for deep neural networksDmitry Yarotsky, Anton Zhevnerchuk
We explore the phase diagram of approximation rates for deep neural networks and prove several new theoretical results. In particular, we generalize the existing result on the existence of deep discontinuous phase in ReLU networks to functional classes of arbitrary positive smoothness, and identify the boundary between the feasible and infeasible rates. Moreover, we show that all networks with a piecewise polynomial activation function have the same phase diagram. Next, we demonstrate that standard fully-connected architectures with a fixed width independent of smoothness can adapt to smoothness and achieve almost optimal rates. Finally, we consider deep networks with periodic activations ("deep Fourier expansion") and prove that they have very fast, nearly exponential approximation rates, thanks to the emerging capability of the network to implement efficient lookup operations.
NEOct 9, 2018
Collective evolution of weights in wide neural networksDmitry Yarotsky
We derive a nonlinear integro-differential transport equation describing collective evolution of weights under gradient descent in large-width neural-network-like models. We characterize stationary points of the evolution and analyze several scenarios where the transport equation can be solved approximately. We test our general method in the special case of linear free-knot splines, and find good agreement between theory and experiment in observations of global optima, stability of stationary points, and convergence rates.
NEApr 26, 2018
Universal approximations of invariant maps by neural networksDmitry Yarotsky
We describe generalizations of the universal approximation theorem for neural networks to maps invariant or equivariant with respect to linear representations of groups. Our goal is to establish network-like computational models that are both invariant/equivariant and provably complete in the sense of their ability to approximate any continuous invariant/equivariant map. Our contribution is three-fold. First, in the general case of compact groups we propose a construction of a complete invariant/equivariant network using an intermediate polynomial layer. We invoke classical theorems of Hilbert and Weyl to justify and simplify this construction; in particular, we describe an explicit complete ansatz for approximation of permutation-invariant maps. Second, we consider groups of translations and prove several versions of the universal approximation theorem for convolutional networks in the limit of continuous signals on euclidean spaces. Finally, we consider 2D signal transformations equivariant with respect to the group SE(2) of rigid euclidean motions. In this case we introduce the "charge--conserving convnet" -- a convnet-like computational model based on the decomposition of the feature space into isotypic representations of SO(2). We prove this model to be a universal approximator for continuous SE(2)--equivariant signal transformations.
NEFeb 10, 2018
Optimal approximation of continuous functions by very deep ReLU networksDmitry Yarotsky
We consider approximations of general continuous functions on finite-dimensional cubes by general deep ReLU neural networks and study the approximation rates with respect to the modulus of continuity of the function and the total number of weights $W$ in the network. We establish the complete phase diagram of feasible approximation rates and show that it includes two distinct phases. One phase corresponds to slower approximations that can be achieved with constant-depth networks and continuous weight assignments. The other phase provides faster approximations at the cost of depths necessarily growing as a power law $L\sim W^α, 0<α\le 1,$ and with necessarily discontinuous weight assignments. In particular, we prove that constant-width fully-connected networks of depth $L\sim W$ provide the fastest possible approximation rate $\|f-\widetilde f\|_\infty = O(ω_f(O(W^{-2/ν})))$ that cannot be achieved with less deep networks.
NEMay 3, 2017
Quantified advantage of discontinuous weight selection in approximations with deep neural networksDmitry Yarotsky
We consider approximations of 1D Lipschitz functions by deep ReLU networks of a fixed width. We prove that without the assumption of continuous weight selection the uniform approximation error is lower than with this assumption at least by a factor logarithmic in the size of the network.
CVJan 16, 2017
Geometric features for voxel-based surface recognitionDmitry Yarotsky
We introduce a library of geometric voxel features for CAD surface recognition/retrieval tasks. Our features include local versions of the intrinsic volumes (the usual 3D volume, surface area, integrated mean and Gaussian curvature) and a few closely related quantities. We also compute Haar wavelet and statistical distribution features by aggregating raw voxel features. We apply our features to object classification on the ESB data set and demonstrate accurate results with a small number of shallow decision trees.
LGOct 3, 2016
Error bounds for approximations with deep ReLU networksDmitry Yarotsky
We study expressive power of shallow and deep neural networks with piece-wise linear activation functions. We establish new rigorous upper and lower bounds for the network complexity in the setting of approximations in Sobolev spaces. In particular, we prove that deep ReLU networks more efficiently approximate smooth functions than shallow networks. In the case of approximations of 1D Lipschitz functions we describe adaptive depth-6 network architectures more efficient than the standard shallow architecture.
MSSep 5, 2016
GTApprox: surrogate modeling for industrial designMikhail Belyaev, Evgeny Burnaev, Ermek Kapushev et al.
We describe GTApprox - a new tool for medium-scale surrogate modeling in industrial design. Compared to existing software, GTApprox brings several innovations: a few novel approximation algorithms, several advanced methods of automated model selection, novel options in the form of hints. We demonstrate the efficiency of GTApprox on a large collection of test problems. In addition, we describe several applications of GTApprox to real engineering problems.
NAOct 21, 2012
Univariate interpolation by exponential functions and gaussian RBFs for generic sets of nodesDmitry Yarotsky
We consider interpolation of univariate functions on arbitrary sets of nodes by Gaussian radial basis functions or by exponential functions. We derive closed-form expressions for the interpolation error based on the Harish-Chandra-Itzykson-Zuber formula. We then prove the exponential convergence of interpolation for functions analytic in a sufficiently large domain. As an application, we prove the global exponential convergence of optimization by expected improvement for such functions.