LGCVJun 10, 2021

Beyond BatchNorm: Towards a Unified Understanding of Normalization in Deep Learning

arXiv:2106.05956v448 citations
Originality Incremental advance
AI Analysis

This work provides a foundational analysis for researchers in deep learning to systematically design and evaluate normalization layers, though it is incremental in extending existing insights.

The paper tackles the problem of understanding and predicting the success of normalization layers in deep neural networks by extending known properties of BatchNorm to other layers, finding that activations-based layers prevent exponential growth in ResNets, GroupNorm ensures informative forward propagation but suffers from slow convergence with large groups, and small groups cause training instability, revealing a unified set of mechanisms.

Inspired by BatchNorm, there has been an explosion of normalization layers in deep learning. Recent works have identified a multitude of beneficial properties in BatchNorm to explain its success. However, given the pursuit of alternative normalization layers, these properties need to be generalized so that any given layer's success/failure can be accurately predicted. In this work, we take a first step towards this goal by extending known properties of BatchNorm in randomly initialized deep neural networks (DNNs) to several recently proposed normalization layers. Our primary findings follow: (i) similar to BatchNorm, activations-based normalization layers can prevent exponential growth of activations in ResNets, but parametric techniques require explicit remedies; (ii) use of GroupNorm can ensure an informative forward propagation, with different samples being assigned dissimilar activations, but increasing group size results in increasingly indistinguishable activations for different samples, explaining slow convergence speed in models with LayerNorm; and (iii) small group sizes result in large gradient norm in earlier layers, hence explaining training instability issues in Instance Normalization and illustrating a speed-stability tradeoff in GroupNorm. Overall, our analysis reveals a unified set of mechanisms that underpin the success of normalization methods in deep learning, providing us with a compass to systematically explore the vast design space of DNN normalization layers.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes