LGJan 28, 2022
Interplay between depth of neural networks and locality of target functionsTakashi Mori, Masahito Ueda
It has been recognized that heavily overparameterized deep neural networks (DNNs) exhibit surprisingly good generalization performance in various machine-learning tasks. Although benefits of depth have been investigated from different perspectives such as the approximation theory and the statistical learning theory, existing theories do not adequately explain the empirical success of overparameterized DNNs. In this work, we report a remarkable interplay between depth and locality of a target function. We introduce $k$-local and $k$-global functions, and find that depth is beneficial for learning local functions but detrimental to learning global functions. This interplay is not properly captured by the neural tangent kernel, which describes an infinitely wide neural network within the lazy learning regime.
LGMay 20, 2021
Power-law escape rate of SGDTakashi Mori, Liu Ziyin, Kangqiao Liu et al.
Stochastic gradient descent (SGD) undergoes complicated multiplicative noise for the mean-square loss. We use this property of SGD noise to derive a stochastic differential equation (SDE) with simpler additive noise by performing a random time change. Using this formalism, we show that the log loss barrier $Δ\log L=\log[L(θ^s)/L(θ^*)]$ between a local minimum $θ^*$ and a saddle $θ^s$ determines the escape rate of SGD from the local minimum, contrary to the previous results borrowing from physics that the linear loss barrier $ΔL=L(θ^s)-L(θ^*)$ decides the escape rate. Our escape-rate formula strongly depends on the typical magnitude $h^*$ and the number $n$ of the outlier eigenvalues of the Hessian. This result explains an empirical fact that SGD prefers flat minima with low effective dimensions, giving an insight into implicit biases of SGD.
LGFeb 10, 2021
Strength of Minibatch Noise in SGDLiu Ziyin, Kangqiao Liu, Takashi Mori et al.
The noise in stochastic gradient descent (SGD), caused by minibatch sampling, is poorly understood despite its practical importance in deep learning. This work presents the first systematic study of the SGD noise and fluctuations close to a local minimum. We first analyze the SGD noise in linear regression in detail and then derive a general formula for approximating SGD noise in different types of minima. For application, our results (1) provide insight into the stability of training a neural network, (2) suggest that a large learning rate can help generalization by introducing an implicit regularization, (3) explain why the linear learning rate-batchsize scaling law fails at a large learning rate or at a small batchsize and (4) can provide an understanding of how discrete-time nature of SGD affects the recently discovered power-law phenomenon of SGD.
LGSep 28, 2020
Improved generalization by noise enhancementTakashi Mori, Masahito Ueda
Recent studies have demonstrated that noise in stochastic gradient descent (SGD) is closely related to generalization: A larger SGD noise, if not too large, results in better generalization. Since the covariance of the SGD noise is proportional to $η^2/B$, where $η$ is the learning rate and $B$ is the minibatch size of SGD, the SGD noise has so far been controlled by changing $η$ and/or $B$. However, too large $η$ results in instability in the training dynamics and a small $B$ prevents scalable parallel computation. It is thus desirable to develop a method of controlling the SGD noise without changing $η$ and $B$. In this paper, we propose a method that achieves this goal using ``noise enhancement'', which is easily implemented in practice. We expound the underlying theoretical idea and demonstrate that the noise enhancement actually improves generalization for real datasets. It turns out that large-batch training with the noise enhancement even shows better generalization compared with small-batch training.
LGMay 26, 2020
Is deeper better? It depends on locality of relevant featuresTakashi Mori, Masahito Ueda
It has been recognized that a heavily overparameterized artificial neural network exhibits surprisingly good generalization performance in various machine-learning tasks. Recent theoretical studies have made attempts to unveil the mystery of the overparameterization. In most of those previous works, the overparameterization is achieved by increasing the width of the network, while the effect of increasing the depth has remained less well understood. In this work, we investigate the effect of increasing the depth within an overparameterized regime. To gain an insight into the advantage of depth, we introduce local and global labels as abstract but simple classification rules. It turns out that the locality of the relevant feature for a given classification rule plays a key role; our experimental results suggest that deeper is better for local labels, whereas shallower is better for global labels. We also compare the results of finite networks with those of the neural tangent kernel (NTK), which is equivalent to an infinitely wide network with a proper initialization and an infinitesimal learning rate. It is shown that the NTK does not correctly capture the depth dependence of the generalization performance, which indicates the importance of the feature learning rather than the lazy learning.