Edge of Stochastic Stability: Revisiting the Edge of Stability for SGD
This work addresses a theoretical gap for researchers in optimization and deep learning by extending stability analysis to stochastic settings, though it appears incremental as it builds directly on prior findings.
The paper tackles the problem that the Edge of Stability phenomenon, observed in full-batch gradient descent, does not hold for mini-batch SGD, limiting its applicability. It shows that SGD operates in an Edge of Stochastic Stability regime where Batch Sharpness stabilizes at 2/η, suppressing the largest Hessian eigenvalue and aligning with empirical observations that smaller batches and larger step sizes favor flatter minima.
Recent findings by Cohen et al., 2021, demonstrate that when training neural networks with full-batch gradient descent with a step size of $η$, the largest eigenvalue $λ_{\max}$ of the full-batch Hessian consistently stabilizes at $λ_{\max} = 2/η$. These results have significant implications for convergence and generalization. This, however, is not the case of mini-batch stochastic gradient descent (SGD), limiting the broader applicability of its consequences. We show that SGD trains in a different regime we term Edge of Stochastic Stability (EoSS). In this regime, what stabilizes at $2/η$ is *Batch Sharpness*: the expected directional curvature of mini-batch Hessians along their corresponding stochastic gradients. As a consequence $λ_{\max}$ -- which is generally smaller than Batch Sharpness -- is suppressed, aligning with the long-standing empirical observation that smaller batches and larger step sizes favor flatter minima. We further discuss implications for mathematical modeling of SGD trajectories.