LGOct 5, 2022

Dynamical Isometry for Residual Networks

arXiv:2210.02411v12 citationsh-index: 16
AI Analysis

This addresses a critical issue for deep learning practitioners by improving training stability and performance in residual networks, especially without Batch Normalization, though it is an incremental advancement over existing initialization methods.

The paper tackles the problem of poor training and generalization in residual networks due to suboptimal random parameter initialization, proposing RISOTTO, an initialization scheme that achieves perfect dynamical isometry for ReLU networks, which outperforms other methods like Fixup and SkipInit in most cases and enables stable training.

The training success, training speed and generalization ability of neural networks rely crucially on the choice of random parameter initialization. It has been shown for multiple architectures that initial dynamical isometry is particularly advantageous. Known initialization schemes for residual blocks, however, miss this property and suffer from degrading separability of different inputs for increasing depth and instability without Batch Normalization or lack feature diversity. We propose a random initialization scheme, RISOTTO, that achieves perfect dynamical isometry for residual networks with ReLU activation functions even for finite depth and width. It balances the contributions of the residual and skip branches unlike other schemes, which initially bias towards the skip connections. In experiments, we demonstrate that in most cases our approach outperforms initialization schemes proposed to make Batch Normalization obsolete, including Fixup and SkipInit, and facilitates stable training. Also in combination with Batch Normalization, we find that RISOTTO often achieves the overall best result.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes