MLLGOCPRFeb 2, 2025

Algorithmic Stability of Stochastic Gradient Descent with Momentum under Heavy-Tailed Noise

arXiv:2502.00885v11 citationsh-index: 22
Originality Incremental advance
AI Analysis

This work addresses a gap in theoretical understanding for heavy-tailed optimizers beyond SGD, providing insights for machine learning practitioners dealing with noisy data.

The paper tackles the generalization properties of stochastic gradient descent with momentum (SGDm) under heavy-tailed gradient noise, establishing that for quadratic loss functions, SGDm has worse generalization bounds in such settings, indicating harmful interactions between momentum and heavy tails.

Understanding the generalization properties of optimization algorithms under heavy-tailed noise has gained growing attention. However, the existing theoretical results mainly focus on stochastic gradient descent (SGD) and the analysis of heavy-tailed optimizers beyond SGD is still missing. In this work, we establish generalization bounds for SGD with momentum (SGDm) under heavy-tailed gradient noise. We first consider the continuous-time limit of SGDm, i.e., a Levy-driven stochastic differential equation (SDE), and establish quantitative Wasserstein algorithmic stability bounds for a class of potentially non-convex loss functions. Our bounds reveal a remarkable observation: For quadratic loss functions, we show that SGDm admits a worse generalization bound in the presence of heavy-tailed noise, indicating that the interaction of momentum and heavy tails can be harmful for generalization. We then extend our analysis to discrete-time and develop a uniform-in-time discretization error bound, which, to our knowledge, is the first result of its kind for SDEs with degenerate noise. This result shows that, with appropriately chosen step-sizes, the discrete dynamics retain the generalization properties of the limiting SDE. We illustrate our theory on both synthetic quadratic problems and neural networks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes