LGMLJun 14, 2022

Scaling ResNets in the Large-depth Regime

arXiv:2206.06929v321 citationsh-index: 71
Originality Highly original
AI Analysis

This work addresses a fundamental training stability issue for deep neural networks, particularly in large-depth regimes, with implications for researchers and practitioners in machine learning.

The paper tackles the problem of training deep ResNets without vanishing or exploding gradients by analyzing scaling factors and weight initializations, showing that a scaling of 1/√L leads to non-trivial dynamics and corresponds to neural stochastic differential equations, while experiments reveal a continuous range of regimes affecting performance.

Deep ResNets are recognized for achieving state-of-the-art results in complex machine learning tasks. However, the remarkable performance of these architectures relies on a training procedure that needs to be carefully crafted to avoid vanishing or exploding gradients, particularly as the depth $L$ increases. No consensus has been reached on how to mitigate this issue, although a widely discussed strategy consists in scaling the output of each layer by a factor $α_L$. We show in a probabilistic setting that with standard i.i.d.~initializations, the only non-trivial dynamics is for $α_L = \frac{1}{\sqrt{L}}$; other choices lead either to explosion or to identity mapping. This scaling factor corresponds in the continuous-time limit to a neural stochastic differential equation, contrarily to a widespread interpretation that deep ResNets are discretizations of neural ordinary differential equations. By contrast, in the latter regime, stability is obtained with specific correlated initializations and $α_L = \frac{1}{L}$. Our analysis suggests a strong interplay between scaling and regularity of the weights as a function of the layer index. Finally, in a series of experiments, we exhibit a continuous range of regimes driven by these two parameters, which jointly impact performance before and after training.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes