LGOCJul 31, 2024

Convergence rates of stochastic gradient method with independent sequences of step-size and momentum weight

arXiv:2408.02678v1
Originality Synthesis-oriented
AI Analysis

This work provides theoretical justification for hyper-parameter tuning in stochastic optimization, which is incremental but relevant for practitioners in machine learning.

The paper analyzes the convergence rates of stochastic gradient method with momentum under two step-size schedules (diminishing-to-zero and constant-and-drop) for strongly convex functions, showing that the rate is exponential in step-size and polynomial in momentum weight, and justifies default settings in large-scale software.

In large-scale learning algorithms, the momentum term is usually included in the stochastic sub-gradient method to improve the learning speed because it can navigate ravines efficiently to reach a local minimum. However, step-size and momentum weight hyper-parameters must be appropriately tuned to optimize convergence. We thus analyze the convergence rate using stochastic programming with Polyak's acceleration of two commonly used step-size learning rates: ``diminishing-to-zero" and ``constant-and-drop" (where the sequence is divided into stages and a constant step-size is applied at each stage) under strongly convex functions over a compact convex set with bounded sub-gradients. For the former, we show that the convergence rate can be written as a product of exponential in step-size and polynomial in momentum weight. Our analysis justifies the convergence of using the default momentum weight setting and the diminishing-to-zero step-size sequence in large-scale machine learning software. For the latter, we present the condition for the momentum weight sequence to converge at each stage.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes