LGNENAMLMay 25, 2021

Scaling Properties of Deep Residual Networks

arXiv:2105.12245v221 citations
Originality Incremental advance
AI Analysis

This challenges the validity of neural ODEs as an asymptotic description of deep ResNets, pointing to alternative differential equations, which is significant for researchers in deep learning theory but incremental in scope.

The authors investigated the scaling properties of deep residual networks (ResNets) trained with stochastic gradient descent, finding that the weights exhibit scaling regimes different from those assumed in neural ordinary differential equation (neural ODE) models, depending on architectural features like activation function smoothness, which can lead to alternative ODE limits, stochastic differential equations, or no clear limit.

Residual networks (ResNets) have displayed impressive results in pattern recognition and, recently, have garnered considerable theoretical interest due to a perceived link with neural ordinary differential equations (neural ODEs). This link relies on the convergence of network weights to a smooth function as the number of layers increases. We investigate the properties of weights trained by stochastic gradient descent and their scaling with network depth through detailed numerical experiments. We observe the existence of scaling regimes markedly different from those assumed in neural ODE literature. Depending on certain features of the network architecture, such as the smoothness of the activation function, one may obtain an alternative ODE limit, a stochastic differential equation or neither of these. These findings cast doubts on the validity of the neural ODE model as an adequate asymptotic description of deep ResNets and point to an alternative class of differential equations as a better description of the deep network limit.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes