LGOct 23, 2025
Global Dynamics of Heavy-Tailed SGDs in Nonconvex Loss Landscape: Characterization and ControlXingyu Wang, Chang-Han Rhee
Stochastic gradient descent (SGD) and its variants enable modern artificial intelligence. However, theoretical understanding lags far behind their empirical success. It is widely believed that SGD has a curious ability to avoid sharp local minima in the loss landscape, which are associated with poor generalization. To unravel this mystery and further enhance such capability of SGDs, it is imperative to go beyond the traditional local convergence analysis and obtain a comprehensive understanding of SGDs' global dynamics. In this paper, we develop a set of technical machinery based on the recent large deviations and metastability analysis in Wang and Rhee (2023) and obtain sharp characterization of the global dynamics of heavy-tailed SGDs. In particular, we reveal a fascinating phenomenon in deep learning: by injecting and then truncating heavy-tailed noises during the training phase, SGD can almost completely avoid sharp minima and achieve better generalization performance for the test data. Simulation and deep learning experiments confirm our theoretical prediction that heavy-tailed SGD with gradient clipping finds local minima with a more flat geometry and achieves better generalization performance.
LGFeb 8, 2021
Eliminating Sharp Minima from SGD with Truncated Heavy-tailed NoiseXingyu Wang, Sewoong Oh, Chang-Han Rhee
The empirical success of deep learning is often attributed to SGD's mysterious ability to avoid sharp local minima in the loss landscape, as sharp minima are known to lead to poor generalization. Recently, empirical evidence of heavy-tailed gradient noise was reported in many deep learning tasks, and it was shown in Şimşekli (2019a,b) that SGD can escape sharp local minima under the presence of such heavy-tailed gradient noise, providing a partial solution to the mystery. In this work, we analyze a popular variant of SGD where gradients are truncated above a fixed threshold. We show that it achieves a stronger notion of avoiding sharp minima: it can effectively eliminate sharp local minima entirely from its training trajectory. We characterize the dynamics of truncated SGD driven by heavy-tailed noises. First, we show that the truncation threshold and width of the attraction field dictate the order of the first exit time from the associated local minimum. Moreover, when the objective function satisfies appropriate structural conditions, we prove that as the learning rate decreases, the dynamics of heavy-tailed truncated SGD closely resemble those of a continuous-time Markov chain that never visits any sharp minima. Real data experiments on deep learning confirm our theoretical prediction that heavy-tailed SGD with gradient clipping finds a "flatter" local minima and achieves better generalization.