LG OCJul 31, 2024

Convergence rates of stochastic gradient method with independent sequences of step-size and momentum weight

arXiv:2408.02678v1

Originality Synthesis-oriented

AI Analysis

This work provides theoretical justification for hyper-parameter tuning in stochastic optimization, which is incremental but relevant for practitioners in machine learning.

The paper analyzes the convergence rates of stochastic gradient method with momentum under two step-size schedules (diminishing-to-zero and constant-and-drop) for strongly convex functions, showing that the rate is exponential in step-size and polynomial in momentum weight, and justifies default settings in large-scale software.

In large-scale learning algorithms, the momentum term is usually included in the stochastic sub-gradient method to improve the learning speed because it can navigate ravines efficiently to reach a local minimum. However, step-size and momentum weight hyper-parameters must be appropriately tuned to optimize convergence. We thus analyze the convergence rate using stochastic programming with Polyak's acceleration of two commonly used step-size learning rates: ``diminishing-to-zero" and ``constant-and-drop" (where the sequence is divided into stages and a constant step-size is applied at each stage) under strongly convex functions over a compact convex set with bounded sub-gradients. For the former, we show that the convergence rate can be written as a product of exponential in step-size and polynomial in momentum weight. Our analysis justifies the convergence of using the default momentum weight setting and the diminishing-to-zero step-size sequence in large-scale machine learning software. For the latter, we present the condition for the momentum weight sequence to converge at each stage.

View on arXiv PDF

Similar