LG AIOct 25, 2024

Understanding Adam Requires Better Rotation Dependent Assumptions

Tianyue H. Zhang, Lucas Maes, Alan Milligan, Alexia Jolicoeur-Martineau, Ioannis Mitliagkas, Damien Scieur, Simon Lacoste-Julien, Charles Guille-Escuret

MILA

arXiv:2410.19964v317.011 citationsh-index: 14

Originality Incremental advance

AI Analysis

This work addresses the lack of theoretical understanding of Adam's advantages over SGD, which is a problem for machine learning researchers and practitioners seeking to optimize deep learning algorithms.

The paper investigates Adam's sensitivity to rotations of the parameter space, finding that its performance in training transformers degrades under random rotations, indicating that conventional rotation-invariant assumptions are insufficient to explain its advantages theoretically.

Despite its widespread adoption, Adam's advantage over Stochastic Gradient Descent (SGD) lacks a comprehensive theoretical explanation. This paper investigates Adam's sensitivity to rotations of the parameter space. We observe that Adam's performance in training transformers degrades under random rotations of the parameter space, indicating a crucial sensitivity to the choice of basis in practice. This reveals that conventional rotation-invariant assumptions are insufficient to capture Adam's advantages theoretically. To better understand the rotation-dependent properties that benefit Adam, we also identify structured rotations that preserve or even enhance its empirical performance. We then examine the rotation-dependent assumptions in the literature and find that they fall short in explaining Adam's behaviour across various rotation types. In contrast, we verify the orthogonality of the update as a promising indicator of Adam's basis sensitivity, suggesting it may be the key quantity for developing rotation-dependent theoretical frameworks that better explain its empirical success.

View on arXiv PDF

Similar