Benefits of Learning Rate Annealing for Tuning-Robustness in Stochastic Optimization
This reduces computational overhead in hyperparameter search for training large-scale models, but it is incremental as it builds on existing annealing methods with a theoretical analysis.
The paper tackles the problem of costly hyperparameter tuning for learning rates in stochastic gradient methods by showing that learning rate annealing schemes, like cosine schedules, are more robust to initial misspecification due to coarse grid searches. It demonstrates that annealed schedules achieve a convergence rate of O(ρ^{1/(2p+1)}/√T), which depends sublinearly on the misspecification factor ρ, compared to O(ρ/√T) for fixed stepsizes, with experiments confirming increased robustness.
The learning rate in stochastic gradient methods is a critical hyperparameter that is notoriously costly to tune via standard grid search, especially for training modern large-scale models with billions of parameters. We identify a theoretical advantage of learning rate annealing schemes that decay the learning rate to zero at a polynomial rate, such as the widely-used cosine schedule, by demonstrating their increased robustness to initial parameter misspecification due to a coarse grid search. We present an analysis in a stochastic convex optimization setup demonstrating that the convergence rate of stochastic gradient descent with annealed schedules depends sublinearly on the multiplicative misspecification factor $ρ$ (i.e., the grid resolution), achieving a rate of $O(ρ^{1/(2p+1)}/\sqrt{T})$ where $p$ is the degree of polynomial decay and $T$ is the number of steps, in contrast to the $O(ρ/\sqrt{T})$ rate that arises with fixed stepsizes and exhibits a linear dependence on $ρ$. Experiments confirm the increased robustness compared to tuning with a fixed stepsize, that has significant implications for the computational overhead of hyperparameter search in practical training scenarios.