LG NEDec 3, 2024

Batch Normalization Decomposed

Ido Nachum, Marco Bondaschi, Michael Gastpar, Anatoly Khina

arXiv:2412.02843v12.61 citationsh-index: 11

Originality Incremental advance

AI Analysis

This provides theoretical insights into a widely used but poorly understood technique in deep learning, though it is incremental as it builds on prior work on linear networks.

The paper tackles the lack of understanding of batch normalization by analyzing its recentering and non-linearity components, revealing that at initialization, the representation converges to a single cluster with an outlier in an orthogonal direction.

\emph{Batch normalization} is a successful building block of neural network architectures. Yet, it is not well understood. A neural network layer with batch normalization comprises three components that affect the representation induced by the network: \emph{recentering} the mean of the representation to zero, \emph{rescaling} the variance of the representation to one, and finally applying a \emph{non-linearity}. Our work follows the work of Hadi Daneshmand, Amir Joudaki, Francis Bach [NeurIPS~'21], which studied deep \emph{linear} neural networks with only the rescaling stage between layers at initialization. In our work, we present an analysis of the other two key components of networks with batch normalization, namely, the recentering and the non-linearity. When these two components are present, we observe a curious behavior at initialization. Through the layers, the representation of the batch converges to a single cluster except for an odd data point that breaks far away from the cluster in an orthogonal direction. We shed light on this behavior from two perspectives: (1) we analyze the geometrical evolution of a simplified indicative model; (2) we prove a stability result for the aforementioned~configuration.

View on arXiv PDF

Similar