OCLGMLNov 3, 2023

High Probability Convergence of Adam Under Unbounded Gradients and Affine Variance Noise

arXiv:2311.02000v111 citationsh-index: 4
Originality Highly original
AI Analysis

This work addresses a foundational problem in machine learning by improving the theoretical understanding of Adam, making it more applicable to real-world scenarios, though it is incremental as it builds on prior convergence analyses.

The paper tackles the theoretical convergence of the Adam algorithm for non-convex stochastic optimization by providing a high-probability convergence rate of O(poly(log T)/√T) under affine variance noise, without requiring bounded gradients or prior problem-dependent knowledge, and shows that gradient magnitudes are bounded by O(poly(log T)).

In this paper, we study the convergence of the Adaptive Moment Estimation (Adam) algorithm under unconstrained non-convex smooth stochastic optimizations. Despite the widespread usage in machine learning areas, its theoretical properties remain limited. Prior researches primarily investigated Adam's convergence from an expectation view, often necessitating strong assumptions like uniformly stochastic bounded gradients or problem-dependent knowledge in prior. As a result, the applicability of these findings in practical real-world scenarios has been constrained. To overcome these limitations, we provide a deep analysis and show that Adam could converge to the stationary point in high probability with a rate of $\mathcal{O}\left({\rm poly}(\log T)/\sqrt{T}\right)$ under coordinate-wise "affine" variance noise, not requiring any bounded gradient assumption and any problem-dependent knowledge in prior to tune hyper-parameters. Additionally, it is revealed that Adam confines its gradients' magnitudes within an order of $\mathcal{O}\left({\rm poly}(\log T)\right)$. Finally, we also investigate a simplified version of Adam without one of the corrective terms and obtain a convergence rate that is adaptive to the noise level.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes