LGMLNov 5, 2024

ADOPT: Modified Adam Can Converge with Any $β_2$ with the Optimal Rate

arXiv:2411.02853v322 citationsh-index: 20Has CodeNIPS
AI Analysis

This addresses a foundational problem in deep learning optimization by providing a theoretically sound and practical alternative to Adam, though it is an incremental improvement over existing variants.

The paper tackles the theoretical non-convergence issue of the Adam optimizer by proposing ADOPT, a modified version that achieves an optimal convergence rate of O(1/√T) with any β₂ parameter, without relying on impractical bounded noise assumptions, and demonstrates superior performance in experiments across tasks like image classification and NLP.

Adam is one of the most popular optimization algorithms in deep learning. However, it is known that Adam does not converge in theory unless choosing a hyperparameter, i.e., $β_2$, in a problem-dependent manner. There have been many attempts to fix the non-convergence (e.g., AMSGrad), but they require an impractical assumption that the gradient noise is uniformly bounded. In this paper, we propose a new adaptive gradient method named ADOPT, which achieves the optimal convergence rate of $\mathcal{O} ( 1 / \sqrt{T} )$ with any choice of $β_2$ without depending on the bounded noise assumption. ADOPT addresses the non-convergence issue of Adam by removing the current gradient from the second moment estimate and changing the order of the momentum update and the normalization by the second moment estimate. We also conduct intensive numerical experiments, and verify that our ADOPT achieves superior results compared to Adam and its variants across a wide range of tasks, including image classification, generative modeling, natural language processing, and deep reinforcement learning. The implementation is available at https://github.com/iShohei220/adopt.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes