On skip connections and normalisation layers in deep optimisation
This work provides foundational insights into deep learning optimization, addressing a core problem for researchers and practitioners by explaining architectural choices that improve training efficiency and convergence.
The authors introduced a theoretical framework to analyze gradient optimization in deep neural networks, focusing on how normalization layers and skip connections affect loss landscape properties. They proved that certain deep networks can be trained to global optima even when these optima are at infinity, and identified a causal mechanism for skip connections accelerating training, validated on datasets like MNIST, CIFAR10, CIFAR100, and ImageNet.
We introduce a general theoretical framework, designed for the study of gradient optimisation of deep neural networks, that encompasses ubiquitous architecture choices including batch normalisation, weight normalisation and skip connections. Our framework determines the curvature and regularity properties of multilayer loss landscapes in terms of their constituent layers, thereby elucidating the roles played by normalisation layers and skip connections in globalising these properties. We then demonstrate the utility of this framework in two respects. First, we give the only proof of which we are aware that a class of deep neural networks can be trained using gradient descent to global optima even when such optima only exist at infinity, as is the case for the cross-entropy cost. Second, we identify a novel causal mechanism by which skip connections accelerate training, which we verify predictively with ResNets on MNIST, CIFAR10, CIFAR100 and ImageNet.