LG AIFeb 26, 2024

Why Transformers Need Adam: A Hessian Perspective

Yushun Zhang, Congliang Chen, Tian Ding, Ziniu Li, Ruoyu Sun, Zhi-Quan Luo

arXiv:2402.16788v436.9130 citationsh-index: 15Has CodeNIPS

Originality Incremental advance

AI Analysis

This addresses a key optimization problem for training Transformers, but is incremental as it builds on known Adam vs. SGD comparisons.

The paper explains why SGD underperforms Adam on Transformers by identifying 'block heterogeneity' in the Hessian spectrum across parameter blocks, showing SGD fails due to a single learning rate, while Adam's coordinate-wise rates handle this better, with experiments confirming SGD matches Adam only without heterogeneity.

SGD performs worse than Adam by a significant margin on Transformers, but the reason remains unclear. In this work, we provide an explanation through the lens of Hessian: (i) Transformers are "heterogeneous": the Hessian spectrum across parameter blocks vary dramatically, a phenomenon we call "block heterogeneity"; (ii) Heterogeneity hampers SGD: SGD performs worse than Adam on problems with block heterogeneity. To validate (i) and (ii), we check various Transformers, CNNs, MLPs, and quadratic problems, and find that SGD can perform on par with Adam on problems without block heterogeneity, but performs worse than Adam when the heterogeneity exists. Our initial theoretical analysis indicates that SGD performs worse because it applies one single learning rate to all blocks, which cannot handle the heterogeneity among blocks. This limitation could be ameliorated if we use coordinate-wise learning rates, as designed in Adam.

View on arXiv PDF Code

Similar