LGOCMLMar 12, 2025

Benefits of Learning Rate Annealing for Tuning-Robustness in Stochastic Optimization

arXiv:2503.09411v11 citationsh-index: 31
Originality Incremental advance
AI Analysis

This reduces computational overhead in hyperparameter search for training large-scale models, but it is incremental as it builds on existing annealing methods with a theoretical analysis.

The paper tackles the problem of costly hyperparameter tuning for learning rates in stochastic gradient methods by showing that learning rate annealing schemes, like cosine schedules, are more robust to initial misspecification due to coarse grid searches. It demonstrates that annealed schedules achieve a convergence rate of O(ρ^{1/(2p+1)}/√T), which depends sublinearly on the misspecification factor ρ, compared to O(ρ/√T) for fixed stepsizes, with experiments confirming increased robustness.

The learning rate in stochastic gradient methods is a critical hyperparameter that is notoriously costly to tune via standard grid search, especially for training modern large-scale models with billions of parameters. We identify a theoretical advantage of learning rate annealing schemes that decay the learning rate to zero at a polynomial rate, such as the widely-used cosine schedule, by demonstrating their increased robustness to initial parameter misspecification due to a coarse grid search. We present an analysis in a stochastic convex optimization setup demonstrating that the convergence rate of stochastic gradient descent with annealed schedules depends sublinearly on the multiplicative misspecification factor $ρ$ (i.e., the grid resolution), achieving a rate of $O(ρ^{1/(2p+1)}/\sqrt{T})$ where $p$ is the degree of polynomial decay and $T$ is the number of steps, in contrast to the $O(ρ/\sqrt{T})$ rate that arises with fixed stepsizes and exhibits a linear dependence on $ρ$. Experiments confirm the increased robustness compared to tuning with a fixed stepsize, that has significant implications for the computational overhead of hyperparameter search in practical training scenarios.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes