OCLGPRMLJun 19, 2020

On the Almost Sure Convergence of Stochastic Gradient Descent in Non-Convex Problems

arXiv:2006.11144v1117 citations
Originality Highly original
AI Analysis

It provides theoretical guarantees for SGD in non-convex optimization, which is incremental but important for machine learning practitioners.

This paper proves that stochastic gradient descent (SGD) converges almost surely in non-convex problems, avoids strict saddle points with probability 1, and achieves a convergence rate of O(1/n^p) with specific step-size schedules, as demonstrated on ResNet architectures with CIFAR.

This paper analyzes the trajectories of stochastic gradient descent (SGD) to help understand the algorithm's convergence properties in non-convex problems. We first show that the sequence of iterates generated by SGD remains bounded and converges with probability $1$ under a very broad range of step-size schedules. Subsequently, going beyond existing positive probability guarantees, we show that SGD avoids strict saddle points/manifolds with probability $1$ for the entire spectrum of step-size policies considered. Finally, we prove that the algorithm's rate of convergence to Hurwicz minimizers is $\mathcal{O}(1/n^{p})$ if the method is employed with a $Θ(1/n^p)$ step-size schedule. This provides an important guideline for tuning the algorithm's step-size as it suggests that a cool-down phase with a vanishing step-size could lead to faster convergence; we demonstrate this heuristic using ResNet architectures on CIFAR.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes