Escaping Saddles with Stochastic Gradients
This addresses a fundamental bottleneck in training non-convex models like deep neural networks, offering a more efficient method for escaping saddle points without extra noise.
The paper tackles the problem of escaping saddle points in non-convex optimization by analyzing stochastic gradient variance along negative curvature directions, showing it can replace explicit noise injection, and derives a dimension-independent convergence rate for SGD to second-order stationary points.
We analyze the variance of stochastic gradients along negative curvature directions in certain non-convex machine learning models and show that stochastic gradients exhibit a strong component along these directions. Furthermore, we show that - contrary to the case of isotropic noise - this variance is proportional to the magnitude of the corresponding eigenvalues and not decreasing in the dimensionality. Based upon this observation we propose a new assumption under which we show that the injection of explicit, isotropic noise usually applied to make gradient descent escape saddle points can successfully be replaced by a simple SGD step. Additionally - and under the same condition - we derive the first convergence rate for plain SGD to a second-order stationary point in a number of iterations that is independent of the problem dimension.