OCLGNASep 8, 2022

Losing momentum in continuous-time stochastic optimisation

arXiv:2209.03705v23 citationsh-index: 11
AI Analysis

This work addresses the theoretical understanding and practical implementation of momentum in stochastic optimization for machine learning, but it is incremental as it builds on existing momentum methods with a continuous-time analysis.

The authors tackled the problem of analyzing momentum-based stochastic optimization in continuous time, showing convergence to the global minimizer under convexity when reducing momentum over time and increasing subsampling rate, and proposed a discretization scheme that achieved competitive results in experiments including neural network training.

The training of modern machine learning models often consists in solving high-dimensional non-convex optimisation problems that are subject to large-scale data. In this context, momentum-based stochastic optimisation algorithms have become particularly widespread. The stochasticity arises from data subsampling which reduces computational cost. Both, momentum and stochasticity help the algorithm to converge globally. In this work, we propose and analyse a continuous-time model for stochastic gradient descent with momentum. This model is a piecewise-deterministic Markov process that represents the optimiser by an underdamped dynamical system and the data subsampling through a stochastic switching. We investigate longtime limits, the subsampling-to-no-subsampling limit, and the momentum-to-no-momentum limit. We are particularly interested in the case of reducing the momentum over time. Under convexity assumptions, we show convergence of our dynamical system to the global minimiser when reducing momentum over time and letting the subsampling rate go to infinity. We then propose a stable, symplectic discretisation scheme to construct an algorithm from our continuous-time dynamical system. In experiments, we study our scheme in convex and non-convex test problems. Additionally, we train a convolutional neural network in an image classification problem. Our algorithm {attains} competitive results compared to stochastic gradient descent with momentum.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes