LGAIOCOct 10, 2025

Stability of Transformers under Layer Normalization

arXiv:2510.09904v15 citationsh-index: 32
Originality Incremental advance
AI Analysis

This work addresses training stability issues for researchers and practitioners using Transformers, offering a principled framework for architectural design, though it is incremental as it builds on existing normalization methods.

The paper tackled the problem of training instability in deep Transformers by studying the effects of layer normalization placement on forward and backward stability, deriving explicit bounds on hidden state growth and analyzing gradient backpropagation to guide scaling choices for improved stability and performance.

Despite their widespread use, training deep Transformers can be unstable. Layer normalization, a standard component, improves training stability, but its placement has often been ad-hoc. In this paper, we conduct a principled study on the forward (hidden states) and backward (gradient) stability of Transformers under different layer normalization placements. Our theory provides key insights into the training dynamics: whether training drives Transformers toward regular solutions or pathological behaviors. For forward stability, we derive explicit bounds on the growth of hidden states in trained Transformers. For backward stability, we analyze how layer normalization affects the backpropagation of gradients, thereby explaining the training dynamics of each layer normalization placement. Our analysis also guides the scaling of residual steps in Transformer blocks, where appropriate choices can further improve stability and performance. Our numerical results corroborate our theoretical findings. Beyond these results, our framework provides a principled way to sanity-check the stability of Transformers under new architectural modifications, offering guidance for future designs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes