Online Learning-guided Learning Rate Adaptation via Gradient Alignment
This addresses the time-consuming hyperparameter tuning for optimizers in large-scale deep learning, offering an incremental improvement through adaptive learning rate schedules.
The paper tackles the problem of fine-tuning learning rates in deep learning by proposing GALA, a framework that dynamically adjusts learning rates based on gradient alignment and local curvature, which robustly improves performance across various initial rates without manual tuning.
The performance of an optimizer on large-scale deep learning models depends critically on fine-tuning the learning rate, often requiring an extensive grid search over base learning rates, schedules, and other hyperparameters. In this paper, we propose a principled framework called GALA (Gradient Alignment-based Learning rate Adaptation), which dynamically adjusts the learning rate by tracking the alignment between consecutive gradients and using a local curvature estimate. Guided by the convergence analysis, we formulate the problem of selecting the learning rate as a one-dimensional online learning problem. When paired with an online learning algorithm such as Follow-the-Regularized-Leader, our method produces a flexible, adaptive learning rate schedule that tends to increase when consecutive gradients are aligned and decrease otherwise. We establish a data-adaptive convergence rate for normalized SGD equipped with GALA in the smooth, nonconvex setting. Empirically, common optimizers such as SGD and Adam, when augmented with GALA, demonstrate robust performance across a wide range of initial learning rates and perform competitively without the need for tuning.