LGCLCVJul 23, 2025

DNT: a Deeply Normalized Transformer that can be trained by Momentum SGD

arXiv:2507.17501v11 citationsh-index: 4
Originality Incremental advance
AI Analysis

This addresses a training bottleneck for researchers and practitioners using Transformers, offering a more stable and efficient optimization method, though it is incremental as it builds on existing normalization techniques.

The paper tackles the problem of training Transformers with momentum SGD instead of adaptive optimizers like AdamW by introducing a Deeply Normalized Transformer (DNT) that strategically integrates normalization to modulate gradients, enabling comparable performance to AdamW-trained Transformers.

Transformers have become the de facto backbone of modern deep learning, yet their training typically demands an advanced optimizer with adaptive learning rate like AdamW, rather than a momentum SGDW (mSGDW). Previous works show that it is mainly due to a heavy-tailed distribution of the gradients. In this paper, we introduce a Deeply Normalized Transformer (DNT), which is meticulously engineered to overcome this limitation enabling seamless training with vanilla mSGDW while yielding comparable performance to the Transformers trained via AdamW. To be specific, in DNT, we strategically integrate normalization techniques at proper positions in the Transformers to effectively modulate the Jacobian matrices of each layer, balance the influence of weights, activations, and their interactions, and thus enable the distributions of gradients concentrated. We provide both theoretical justifications of the normalization technique used in our DNT and extensive empirical evaluation on two popular Transformer architectures to validate that: a) DNT outperforms its counterparts (\ie, ViT and GPT), and b) DNT can be effectively trained with vanilla mSGDW.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes