Norm matters: efficient and accurate normalization schemes in deep networks
This work addresses the need for more efficient and stable normalization techniques in deep learning, particularly for low-precision and large-scale applications, though it is incremental as it builds on existing normalization methods.
The paper tackles the problem of understanding and improving normalization methods in deep networks by proposing novel normalization schemes in L1 and L∞ spaces, which enhance numerical stability in low-precision implementations and offer computational benefits, enabling the first batch-norm alternative for half-precision and improving weight-normalization on large-scale tasks.
Over the past few years, Batch-Normalization has been commonly used in deep networks, allowing faster training and high performance for a wide variety of applications. However, the reasons behind its merits remained unanswered, with several shortcomings that hindered its use for certain tasks. In this work, we present a novel view on the purpose and function of normalization methods and weight-decay, as tools to decouple weights' norm from the underlying optimized objective. This property highlights the connection between practices such as normalization, weight decay and learning-rate adjustments. We suggest several alternatives to the widely used $L^2$ batch-norm, using normalization in $L^1$ and $L^\infty$ spaces that can substantially improve numerical stability in low-precision implementations as well as provide computational and memory benefits. We demonstrate that such methods enable the first batch-norm alternative to work for half-precision implementations. Finally, we suggest a modification to weight-normalization, which improves its performance on large-scale tasks.