Understanding Adam Requires Better Rotation Dependent Assumptions
This work addresses the lack of theoretical understanding of Adam's advantages over SGD, which is a problem for machine learning researchers and practitioners seeking to optimize deep learning algorithms.
The paper investigates Adam's sensitivity to rotations of the parameter space, finding that its performance in training transformers degrades under random rotations, indicating that conventional rotation-invariant assumptions are insufficient to explain its advantages theoretically.
Despite its widespread adoption, Adam's advantage over Stochastic Gradient Descent (SGD) lacks a comprehensive theoretical explanation. This paper investigates Adam's sensitivity to rotations of the parameter space. We observe that Adam's performance in training transformers degrades under random rotations of the parameter space, indicating a crucial sensitivity to the choice of basis in practice. This reveals that conventional rotation-invariant assumptions are insufficient to capture Adam's advantages theoretically. To better understand the rotation-dependent properties that benefit Adam, we also identify structured rotations that preserve or even enhance its empirical performance. We then examine the rotation-dependent assumptions in the literature and find that they fall short in explaining Adam's behaviour across various rotation types. In contrast, we verify the orthogonality of the update as a promising indicator of Adam's basis sensitivity, suggesting it may be the key quantity for developing rotation-dependent theoretical frameworks that better explain its empirical success.