LGOCMLAug 11, 2023

Adaptive SGD with Polyak stepsize and Line-search: Robust Convergence and Variance Reduction

arXiv:2308.06058v236 citationsh-index: 40
Originality Highly original
AI Analysis

This work addresses convergence problems in adaptive SGD methods for machine learning practitioners, offering improved algorithms for non-interpolation settings.

The authors tackled the issue of stochastic Polyak stepsize and stochastic line-search converging only to a neighborhood of a solution in non-interpolation settings, proposing AdaSPS and AdaSLS variants that guarantee convergence and maintain sub-linear and linear rates for convex and strongly convex functions, with AdaSPS requiring $\smash{\widetilde{\mathcal{O}}}(n+1/ε)$ gradient evaluations for $\mathcal{O}(ε)$-suboptimality after adding variance reduction.

The recently proposed stochastic Polyak stepsize (SPS) and stochastic line-search (SLS) for SGD have shown remarkable effectiveness when training over-parameterized models. However, in non-interpolation settings, both algorithms only guarantee convergence to a neighborhood of a solution which may result in a worse output than the initial guess. While artificially decreasing the adaptive stepsize has been proposed to address this issue (Orvieto et al. [2022]), this approach results in slower convergence rates for convex and over-parameterized models. In this work, we make two contributions: Firstly, we propose two new variants of SPS and SLS, called AdaSPS and AdaSLS, which guarantee convergence in non-interpolation settings and maintain sub-linear and linear convergence rates for convex and strongly convex functions when training over-parameterized models. AdaSLS requires no knowledge of problem-dependent parameters, and AdaSPS requires only a lower bound of the optimal function value as input. Secondly, we equip AdaSPS and AdaSLS with a novel variance reduction technique and obtain algorithms that require $\smash{\widetilde{\mathcal{O}}}(n+1/ε)$ gradient evaluations to achieve an $\mathcal{O}(ε)$-suboptimality for convex functions, which improves upon the slower $\mathcal{O}(1/ε^2)$ rates of AdaSPS and AdaSLS without variance reduction in the non-interpolation regimes. Moreover, our result matches the fast rates of AdaSVRG but removes the inner-outer-loop structure, which is easier to implement and analyze. Finally, numerical experiments on synthetic and real datasets validate our theory and demonstrate the effectiveness and robustness of our algorithms.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes