LGAIJun 3, 2024

On the Nonlinearity of Layer Normalization

arXiv:2406.01255v110 citations
AI Analysis

This work addresses a theoretical gap in understanding LN, which is widely used in deep learning, by analyzing its nonlinear properties and potential for neural architecture design.

The paper investigates the nonlinearity and representation capacity of layer normalization (LN), showing that an LN-Net with only 3 neurons per layer and O(m) layers can correctly classify m samples with any label assignment, and provides a lower bound for its VC dimension.

Layer normalization (LN) is a ubiquitous technique in deep learning but our theoretical understanding to it remains elusive. This paper investigates a new theoretical direction for LN, regarding to its nonlinearity and representation capacity. We investigate the representation capacity of a network with layerwise composition of linear and LN transformations, referred to as LN-Net. We theoretically show that, given $m$ samples with any label assignment, an LN-Net with only 3 neurons in each layer and $O(m)$ LN layers can correctly classify them. We further show the lower bound of the VC dimension of an LN-Net. The nonlinearity of LN can be amplified by group partition, which is also theoretically demonstrated with mild assumption and empirically supported by our experiments. Based on our analyses, we consider to design neural architecture by exploiting and amplifying the nonlinearity of LN, and the effectiveness is supported by our experiments.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes