LGOCMLJan 26, 2019

Escaping Saddle Points with Adaptive Gradient Methods

arXiv:1901.09149v283 citations
AI Analysis

This provides theoretical insights for deep learning practitioners using adaptive methods, though it is incremental as it builds on existing optimization frameworks.

The paper tackles the problem of understanding adaptive gradient methods like Adam and RMSProp in nonconvex optimization by showing they act as preconditioned SGD, which helps escape saddle points faster than SGD, leading to the first second-order convergence result for such methods.

Adaptive methods such as Adam and RMSProp are widely used in deep learning but are not well understood. In this paper, we seek a crisp, clean and precise characterization of their behavior in nonconvex settings. To this end, we first provide a novel view of adaptive methods as preconditioned SGD, where the preconditioner is estimated in an online manner. By studying the preconditioner on its own, we elucidate its purpose: it rescales the stochastic gradient noise to be isotropic near stationary points, which helps escape saddle points. Furthermore, we show that adaptive methods can efficiently estimate the aforementioned preconditioner. By gluing together these two components, we provide the first (to our knowledge) second-order convergence result for any adaptive method. The key insight from our analysis is that, compared to SGD, adaptive methods escape saddle points faster, and can converge faster overall to second-order stationary points.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes