LGOCMar 5, 2021

Second-order step-size tuning of SGD for non-convex optimization

arXiv:2103.03570v211 citations
AI Analysis

This work addresses optimization challenges in deep learning, offering an incremental improvement over existing first-order methods for non-convex problems.

The paper tackles the problem of improving stochastic gradient descent (SGD) for non-convex optimization by fine-tuning step-sizes using second-order curvature estimates, resulting in a method that shows a sudden drop in loss and improved test accuracy during deep residual network training, outperforming SGD, RMSprop, and ADAM.

In view of a direct and simple improvement of vanilla SGD, this paper presents a fine-tuning of its step-sizes in the mini-batch case. For doing so, one estimates curvature, based on a local quadratic model and using only noisy gradient approximations. One obtains a new stochastic first-order method (Step-Tuned SGD), enhanced by second-order information, which can be seen as a stochastic version of the classical Barzilai-Borwein method. Our theoretical results ensure almost sure convergence to the critical set and we provide convergence rates. Experiments on deep residual network training illustrate the favorable properties of our approach. For such networks we observe, during training, both a sudden drop of the loss and an improvement of test accuracy at medium stages, yielding better results than SGD, RMSprop, or ADAM.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes