LGJun 5, 2021

Escaping Saddle Points Faster with Stochastic Momentum

arXiv:2106.02985v125 citations
Originality Incremental advance
AI Analysis

This addresses a significant open question in machine learning by providing theoretical justification for the empirical success of momentum-based methods in deep learning, though it is incremental as it builds on existing SGD frameworks.

The paper tackles the problem of slow convergence in nonconvex stochastic optimization, particularly for deep neural network training, by showing that stochastic momentum helps SGD escape saddle points faster, leading to quicker convergence to second-order stationary points, with theoretical and experimental validation.

Stochastic gradient descent (SGD) with stochastic momentum is popular in nonconvex stochastic optimization and particularly for the training of deep neural networks. In standard SGD, parameters are updated by improving along the path of the gradient at the current iterate on a batch of examples, where the addition of a ``momentum'' term biases the update in the direction of the previous change in parameters. In non-stochastic convex optimization one can show that a momentum adjustment provably reduces convergence time in many settings, yet such results have been elusive in the stochastic and non-convex settings. At the same time, a widely-observed empirical phenomenon is that in training deep networks stochastic momentum appears to significantly improve convergence time, variants of it have flourished in the development of other popular update methods, e.g. ADAM [KB15], AMSGrad [RKK18], etc. Yet theoretical justification for the use of stochastic momentum has remained a significant open question. In this paper we propose an answer: stochastic momentum improves deep network training because it modifies SGD to escape saddle points faster and, consequently, to more quickly find a second order stationary point. Our theoretical results also shed light on the related question of how to choose the ideal momentum parameter--our analysis suggests that $β\in [0,1)$ should be large (close to 1), which comports with empirical findings. We also provide experimental findings that further validate these conclusions.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes