Momentum Streams for Optimizer-Inspired Transformers

arXiv:2605.2442528.2

AI Analysis

For practitioners seeking improved Transformer training, this work provides a principled way to incorporate optimizer momentum into architecture design, yielding measurable gains in loss and generalization.

The paper introduces optimizer-inspired Transformer architectures, with the triple-momentum TMMFormer achieving the lowest validation loss in pretraining, outperforming vanilla Transformers and prior variants. Momentum is identified as the key driver of gains, leading to flatter minima and better generalization.

The residual update of a pre-norm Transformer layer admits an interpretation as one step of a first-order optimizer acting on a surrogate token energy, wherein the attention and MLP sublayers function as gradient oracles. Based on this observation, we build a family of optimizer-inspired Transformers (triple-momentum, Adam/AdamW, Muon, SOAP) and compare them under matched compute. In our main pretraining experiment, the triple-momentum TMMFormer achieves the lowest validation loss, outperforming the vanilla Transformer and prior architectural variants. A controlled ablation and supporting theory show that momentum, not preconditioning, is the main source of the gain. We further show that TMMFormer and other momentum-based designs reach flatter minima than the vanilla Transformer, which leads to less forgetting and better generalization.

View on arXiv PDF

Similar