LGNEDec 3, 2024

Batch Normalization Decomposed

arXiv:2412.02843v11 citationsh-index: 11
Originality Incremental advance
AI Analysis

This provides theoretical insights into a widely used but poorly understood technique in deep learning, though it is incremental as it builds on prior work on linear networks.

The paper tackles the lack of understanding of batch normalization by analyzing its recentering and non-linearity components, revealing that at initialization, the representation converges to a single cluster with an outlier in an orthogonal direction.

\emph{Batch normalization} is a successful building block of neural network architectures. Yet, it is not well understood. A neural network layer with batch normalization comprises three components that affect the representation induced by the network: \emph{recentering} the mean of the representation to zero, \emph{rescaling} the variance of the representation to one, and finally applying a \emph{non-linearity}. Our work follows the work of Hadi Daneshmand, Amir Joudaki, Francis Bach [NeurIPS~'21], which studied deep \emph{linear} neural networks with only the rescaling stage between layers at initialization. In our work, we present an analysis of the other two key components of networks with batch normalization, namely, the recentering and the non-linearity. When these two components are present, we observe a curious behavior at initialization. Through the layers, the representation of the batch converges to a single cluster except for an odd data point that breaks far away from the cluster in an orthogonal direction. We shed light on this behavior from two perspectives: (1) we analyze the geometrical evolution of a simplified indicative model; (2) we prove a stability result for the aforementioned~configuration.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes