Where You Place the Norm Matters: From Prejudiced to Neutral Initializations

arXiv:2505.1131216.11 citationsh-index: 44
Predicted impact top 55% in LG · last 90 daysOriginality Highly original
AI Analysis

This provides principled guidance for more controlled and interpretable network design, addressing a foundational issue in deep learning for researchers and practitioners.

The paper tackles the problem of how normalization layers affect neural network behavior at initialization, showing that choices like BatchNorm vs. LayerNorm and Pre-Norm vs. Post-Norm can lead to initial predictions ranging from unbiased (Neutral) to highly concentrated (Prejudiced) regimes, thereby modulating learning dynamics.

Normalization layers were introduced to stabilize and accelerate training, yet their influence is critical already at initialization, where they shape signal propagation and output statistics before parameters adapt to data. In practice, both which normalization to use and where to place it are often chosen heuristically, despite the fact that these decisions can qualitatively alter a model's behavior. We provide a theoretical characterization of how normalization choice and placement (Pre-Norm vs. Post-Norm) determine the distribution of class predictions at initialization, ranging from unbiased (Neutral) to highly concentrated (Prejudiced) regimes. We show that these architectural decisions induce systematic shifts in the initial prediction regime, thereby modulating subsequent learning dynamics. By linking normalization design directly to prediction statistics at initialization, our results offer principled guidance for more controlled and interpretable network design, including clarifying how widely used choices such as BatchNorm vs. LayerNorm and Pre-Norm vs. Post-Norm shape behavior from the outset of training.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes