LGOCJun 5, 2021

Bandwidth-based Step-Sizes for Non-Convex Stochastic Optimization

arXiv:2106.02888v22 citations
Originality Incremental advance
AI Analysis

This work addresses the need for theoretical foundations for popular but unproven learning-rate schedules in deep learning, offering incremental improvements in understanding and optimization.

The paper tackles the problem of providing theoretical convergence guarantees for a class of learning-rate schedules, including cyclic and non-monotonic step-sizes, in non-convex stochastic optimization, showing optimality for step-decay and verifying efficiency on deep neural network tasks.

Many popular learning-rate schedules for deep neural networks combine a decaying trend with local perturbations that attempt to escape saddle points and bad local minima. We derive convergence guarantees for bandwidth-based step-sizes, a general class of learning rates that are allowed to vary in a banded region. This framework includes many popular cyclic and non-monotonic step-sizes for which no theoretical guarantees were previously known. We provide worst-case guarantees for SGD on smooth non-convex problems under several bandwidth-based step sizes, including stagewise $1/\sqrt{t}$ and the popular step-decay (constant and then drop by a constant), which is also shown to be optimal. Moreover, we show that its momentum variant converges as fast as SGD with the bandwidth-based step-decay step-size. Finally, we propose novel step-size schemes in the bandwidth-based family and verify their efficiency on several deep neural network training tasks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes