LGMLFeb 9, 2022

Optimal learning rate schedules in high-dimensional non-convex optimization problems

arXiv:2202.04509v110 citations
Originality Incremental advance
AI Analysis

This work addresses the challenge of efficient optimization in realistic machine learning settings with non-convex landscapes, providing theoretical insights for practitioners, though it is incremental as it builds on existing scheduling methods.

The paper tackles the problem of optimizing learning rate schedules for high-dimensional non-convex optimization, finding that a decay rate β<1 is optimal to avoid saddles, and a two-phase schedule with large rates for exploration and convex rates for convergence yields the best results, as validated in neural network regression tasks.

Learning rate schedules are ubiquitously used to speed up and improve optimisation. Many different policies have been introduced on an empirical basis, and theoretical analyses have been developed for convex settings. However, in many realistic problems the loss-landscape is high-dimensional and non convex -- a case for which results are scarce. In this paper we present a first analytical study of the role of learning rate scheduling in this setting, focusing on Langevin optimization with a learning rate decaying as $η(t)=t^{-β}$. We begin by considering models where the loss is a Gaussian random function on the $N$-dimensional sphere ($N\rightarrow \infty$), featuring an extensive number of critical points. We find that to speed up optimization without getting stuck in saddles, one must choose a decay rate $β<1$, contrary to convex setups where $β=1$ is generally optimal. We then add to the problem a signal to be recovered. In this setting, the dynamics decompose into two phases: an \emph{exploration} phase where the dynamics navigates through rough parts of the landscape, followed by a \emph{convergence} phase where the signal is detected and the dynamics enter a convex basin. In this case, it is optimal to keep a large learning rate during the exploration phase to escape the non-convex region as quickly as possible, then use the convex criterion $β=1$ to converge rapidly to the solution. Finally, we demonstrate that our conclusions hold in a common regression task involving neural networks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes