LGPROct 23, 2025

Global Dynamics of Heavy-Tailed SGDs in Nonconvex Loss Landscape: Characterization and Control

arXiv:2510.20905v12 citationsh-index: 3
Originality Incremental advance
AI Analysis

This work addresses the theoretical gap in understanding SGD's global behavior for machine learning practitioners, offering a method to enhance generalization in nonconvex optimization, though it builds incrementally on existing large deviations analysis.

The paper tackles the problem of understanding and improving SGD's ability to avoid sharp local minima for better generalization by characterizing the global dynamics of heavy-tailed SGDs, showing that injecting and truncating heavy-tailed noise enables SGD to almost completely avoid sharp minima and achieve improved generalization performance in simulations and deep learning experiments.

Stochastic gradient descent (SGD) and its variants enable modern artificial intelligence. However, theoretical understanding lags far behind their empirical success. It is widely believed that SGD has a curious ability to avoid sharp local minima in the loss landscape, which are associated with poor generalization. To unravel this mystery and further enhance such capability of SGDs, it is imperative to go beyond the traditional local convergence analysis and obtain a comprehensive understanding of SGDs' global dynamics. In this paper, we develop a set of technical machinery based on the recent large deviations and metastability analysis in Wang and Rhee (2023) and obtain sharp characterization of the global dynamics of heavy-tailed SGDs. In particular, we reveal a fascinating phenomenon in deep learning: by injecting and then truncating heavy-tailed noises during the training phase, SGD can almost completely avoid sharp minima and achieve better generalization performance for the test data. Simulation and deep learning experiments confirm our theoretical prediction that heavy-tailed SGD with gradient clipping finds local minima with a more flat geometry and achieves better generalization performance.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes