Arseniy Andreyev

h-index4
2papers

2 Papers

41.8LGApr 15
Momentum Further Constrains Sharpness at the Edge of Stochastic Stability

Arseniy Andreyev, Advikar Ananthkumar, Marc Walden et al.

Recent work suggests that (stochastic) gradient descent self-organizes near an instability boundary, shaping both optimization and the solutions found. Momentum and mini-batch gradients are widely used in practical deep learning optimization, but it remains unclear whether they operate in a comparable regime of instability. We demonstrate that SGD with momentum exhibits an Edge of Stochastic Stability (EoSS)-like regime with batch-size-dependent behavior that cannot be explained by a single momentum-adjusted stability threshold. Batch Sharpness (the expected directional mini-batch curvature) stabilizes in two distinct regimes: at small batch sizes it converges to a lower plateau $2(1-β)/η$, reflecting amplification of stochastic fluctuations by momentum and favoring flatter regions than vanilla SGD; at large batch sizes it converges to a higher plateau $2(1+β)/η$, where momentum recovers its classical stabilizing effect and favors sharper regions consistent with full-batch dynamics. We further show that this aligns with linear stability thresholds and discuss the implications for hyperparameter tuning and coupling.

LGDec 29, 2024
Edge of Stochastic Stability: Revisiting the Edge of Stability for SGD

Arseniy Andreyev, Pierfrancesco Beneventano

Recent findings by Cohen et al., 2021, demonstrate that when training neural networks with full-batch gradient descent with a step size of $η$, the largest eigenvalue $λ_{\max}$ of the full-batch Hessian consistently stabilizes at $λ_{\max} = 2/η$. These results have significant implications for convergence and generalization. This, however, is not the case of mini-batch stochastic gradient descent (SGD), limiting the broader applicability of its consequences. We show that SGD trains in a different regime we term Edge of Stochastic Stability (EoSS). In this regime, what stabilizes at $2/η$ is *Batch Sharpness*: the expected directional curvature of mini-batch Hessians along their corresponding stochastic gradients. As a consequence $λ_{\max}$ -- which is generally smaller than Batch Sharpness -- is suppressed, aligning with the long-standing empirical observation that smaller batches and larger step sizes favor flatter minima. We further discuss implications for mathematical modeling of SGD trajectories.