LGAIOCApr 1, 2022

Learning to Accelerate by the Methods of Step-size Planning

arXiv:2204.01705v4h-index: 18
Originality Highly original
AI Analysis

This work addresses the problem of accelerating optimization algorithms for researchers and practitioners in machine learning, offering a novel approach that is incremental but with strong empirical gains.

The paper tackles the slow convergence of gradient descent on ill-conditioned and non-convex problems by introducing step-size planning methods, which use past update experience to learn improved step-size models. Results show that for a convex problem, the method surpasses the theoretical limit of Nesterov's accelerated gradient with a convergence rate better than 1 - √(μ/L), and on the non-convex Rosenbrock function, it achieves zero error in under 500 gradient evaluations compared to 10,000 for gradient descent to reach 10^{-3} accuracy.

Gradient descent is slow to converge for ill-conditioned problems and non-convex problems. An important technique for acceleration is step-size adaptation. The first part of this paper contains a detailed review of step-size adaptation methods, including Polyak step-size, L4, LossGrad, Adam, IDBD, and Hypergradient descent, and the relation of step-size adaptation to meta-gradient methods. In the second part of this paper, we propose a new class of methods of accelerating gradient descent that have some distinctiveness from existing techniques. The new methods, which we call {\em step-size planning}, use the {\em update experience} to learn an improved way of updating the parameters. The methods organize the experience into $K$ steps away from each other to facilitate planning. From the past experience, our planning algorithm, Csawg, learns a step-size model which is a form of multi-step machine that predicts future updates. We extends Csawg to applying step-size planning multiple steps, which leads to further speedup. We discuss and highlight the projection power of the diagonal-matrix step-size for future large scale applications. We show for a convex problem, our methods can surpass the convergence rate of Nesterov's accelerated gradient, $1 - \sqrt{μ/L}$, where $μ, L$ are the strongly convex factor of the loss function $F$ and the Lipschitz constant of $F'$, which is the theoretical limit for the convergence rate of first-order methods. On the well-known non-convex Rosenbrock function, our planning methods achieve zero error below 500 gradient evaluations, while gradient descent takes about 10000 gradient evaluations to reach a $10^{-3}$ accuracy. We discuss the connection of step-size planing to planning in reinforcement learning, in particular, Dyna architectures. (This is a shorter abstract than in the paper because of length requirement)

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes