LGFeb 22, 2021

Provable Super-Convergence with a Large Cyclical Learning Rate

arXiv:2102.10734v216 citations
Originality Highly original
AI Analysis

This work addresses optimization efficiency for machine learning practitioners, offering a provable method to accelerate training in problems with bimodal Hessian spectra, though it is incremental as it builds on existing cyclical learning rate concepts.

The paper tackles the problem of slow convergence in gradient-based optimization by introducing a cyclical learning rate scheme with an unstably large step, achieving a convergence rate that depends only logarithmically on the condition number. This result provides a theoretical explanation for empirical observations of 'super-convergence' in prior work.

Conventional wisdom dictates that learning rate should be in the stable regime so that gradient-based algorithms don't blow up. This letter introduces a simple scenario where an unstably large learning rate scheme leads to a super fast convergence, with the convergence rate depending only logarithmically on the condition number of the problem. Our scheme uses a Cyclical Learning Rate (CLR) where we periodically take one large unstable step and several small stable steps to compensate for the instability. These findings also help explain the empirical observations of [Smith and Topin, 2019] where they show that CLR with a large maximum learning rate can dramatically accelerate learning and lead to so-called "super-convergence". We prove that our scheme excels in the problems where Hessian exhibits a bimodal spectrum and the eigenvalues can be grouped into two clusters (small and large). The unstably large step is the key to enabling fast convergence over the small eigen-spectrum.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes