Normalization in Attention Dynamics
This work addresses the challenge of optimizing normalization in transformers for machine learning practitioners, but it is incremental as it builds on existing schemes.
The paper tackled the problem of understanding how normalization schemes affect token representations in deep transformers by modeling them as interacting particles on a sphere, revealing that normalization regulates speed and identifying Peri-LN as an effective choice.
We study the effect of normalization schemes on token representations in deep transformers. Modeling their evolution as interacting particles on the sphere, we show that normalization acts as a form of speed regulation. This perspective enables a unified analysis of several schemes -- including Post-LN, Pre-LN, Mix-LN, Peri-LN, nGPT -- revealing how they influence clustering dynamics and representation collapse. Our framework clarifies how different schemes shape token representations across layers and provides a principled basis for comparing them, identifying Peri-LN as a particularly effective choice.