MLLGOCMay 21, 2018

On the Convergence of Stochastic Gradient Descent with Adaptive Stepsizes

arXiv:1805.08114v3341 citations
Originality Highly original
AI Analysis

This work addresses a foundational problem in machine learning optimization by providing theoretical guarantees for adaptive methods, which is incremental but important for improving training efficiency and reliability.

The paper tackles the theoretical gap in understanding adaptive stepsizes for stochastic gradient descent, particularly in non-convex settings, by analyzing a generalized AdaGrad version and proving almost sure asymptotic convergence of gradients to zero, with convergence rates interpolating between O(1/T) and O(1/√T) up to logarithmic terms.

Stochastic gradient descent is the method of choice for large scale optimization of machine learning objective functions. Yet, its performance is greatly variable and heavily depends on the choice of the stepsizes. This has motivated a large body of research on adaptive stepsizes. However, there is currently a gap in our theoretical understanding of these methods, especially in the non-convex setting. In this paper, we start closing this gap: we theoretically analyze in the convex and non-convex settings a generalized version of the AdaGrad stepsizes. We show sufficient conditions for these stepsizes to achieve almost sure asymptotic convergence of the gradients to zero, proving the first guarantee for generalized AdaGrad stepsizes in the non-convex setting. Moreover, we show that these stepsizes allow to automatically adapt to the level of noise of the stochastic gradients in both the convex and non-convex settings, interpolating between $O(1/T)$ and $O(1/\sqrt{T})$, up to logarithmic terms.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes