Momentum-Based Variance Reduction in Non-Convex SGD
This addresses the complexity and tuning issues in non-convex optimization for machine learning practitioners, offering a simpler and more efficient method.
The paper tackles the problem of variance reduction in non-convex stochastic gradient descent by introducing STORM, an algorithm that eliminates the need for batches and adaptive learning rates, achieving an optimal convergence rate of O(1/√T + σ^{1/3}/T^{1/3}) without requiring knowledge of gradient variance σ.
Variance reduction has emerged in recent years as a strong competitor to stochastic gradient descent in non-convex problems, providing the first algorithms to improve upon the converge rate of stochastic gradient descent for finding first-order critical points. However, variance reduction techniques typically require carefully tuned learning rates and willingness to use excessively large "mega-batches" in order to achieve their improved results. We present a new algorithm, STORM, that does not require any batches and makes use of adaptive learning rates, enabling simpler implementation and less hyperparameter tuning. Our technique for removing the batches uses a variant of momentum to achieve variance reduction in non-convex optimization. On smooth losses $F$, STORM finds a point $\boldsymbol{x}$ with $\mathbb{E}[\|\nabla F(\boldsymbol{x})\|]\le O(1/\sqrt{T}+σ^{1/3}/T^{1/3})$ in $T$ iterations with $σ^2$ variance in the gradients, matching the optimal rate but without requiring knowledge of $σ$.