LG OCFeb 5, 2024

Can We Remove the Square-Root in Adaptive Gradient Methods? A Second-Order Perspective

Wu Lin, Felix Dangel, Runa Eschenhagen, Juhan Bae, Richard E. Turner, Alireza Makhzani

U of Toronto

arXiv:2402.03496v1020.723 citationsh-index: 15Has CodeICML

Originality Incremental advance

AI Analysis

This work addresses the development of more stable and efficient adaptive optimizers for deep learning, offering practical benefits for training in half-precision, though it is incremental in modifying existing methods.

The paper investigates removing the square-root from adaptive gradient optimizers like Adam, finding that square-root-free methods close the generalization gap to SGD on convolutional architectures while maintaining performance on transformers, and they enable stable half-precision training without numerical issues.

Adaptive gradient optimizers like Adam(W) are the default training algorithms for many deep learning architectures, such as transformers. Their diagonal preconditioner is based on the gradient outer product which is incorporated into the parameter update via a square root. While these methods are often motivated as approximate second-order methods, the square root represents a fundamental difference. In this work, we investigate how the behavior of adaptive methods changes when we remove the root, i.e., strengthen their second-order motivation. Surprisingly, we find that such square-root-free adaptive methods close the generalization gap to SGD on convolutional architectures, while maintaining their root-based counterpart's performance on transformers. The second-order perspective also has practical benefits for developing non-diagonal methods that can incorporate arbitrary curvature approximations through the concept of preconditioner invariance. In contrast to root-based methods like Shampoo, root-free counterparts work well and fast with half-precision since they do not require numerically unstable matrix root decompositions and inversions. Overall, our findings provide new insights into the development of adaptive methods and raise important questions regarding the overlooked role of adaptivity in their success. (experiment code: https://github.com/yorkerlin/remove-the-square-root optimizer code: https://github.com/f-dangel/sirfshampoo)

View on arXiv PDF Code

Similar