Understanding the Generalization Benefit of Normalization Layers: Sharpness Reduction
This work addresses a foundational issue in machine learning by explaining the generalization benefits of normalization layers, which is incremental as it builds on existing beliefs about flat minima.
The paper tackles the problem of understanding why normalization layers improve generalization in neural networks, showing through mathematical analysis and experiments that normalization, combined with weight decay, encourages gradient descent to reduce the sharpness of the loss surface, leading to better generalization.
Normalization layers (e.g., Batch Normalization, Layer Normalization) were introduced to help with optimization difficulties in very deep nets, but they clearly also help generalization, even in not-so-deep nets. Motivated by the long-held belief that flatter minima lead to better generalization, this paper gives mathematical analysis and supporting experiments suggesting that normalization (together with accompanying weight-decay) encourages GD to reduce the sharpness of loss surface. Here "sharpness" is carefully defined given that the loss is scale-invariant, a known consequence of normalization. Specifically, for a fairly broad class of neural nets with normalization, our theory explains how GD with a finite learning rate enters the so-called Edge of Stability (EoS) regime, and characterizes the trajectory of GD in this regime via a continuous sharpness-reduction flow.