LG OC MLMay 13, 2025

Rapid Overfitting of Multi-Pass Stochastic Gradient Descent in Stochastic Convex Optimization

Shira Vansover-Hager, Tomer Koren, Roi Livni

arXiv:2505.08306v12 citationsh-index: 31

Originality Incremental advance

AI Analysis

This reveals a phase-transition in SGD's out-of-sample behavior, addressing a fundamental gap in understanding for practitioners in machine learning optimization, though it is incremental as it builds on prior work on one-pass SGD.

The paper shows that multi-pass stochastic gradient descent (SGD) can overfit significantly in non-smooth stochastic convex optimization, with population loss as large as Ω(1) after just one additional pass using an optimal step size, and provides bounds on the loss after the first epoch.

We study the out-of-sample performance of multi-pass stochastic gradient descent (SGD) in the fundamental stochastic convex optimization (SCO) model. While one-pass SGD is known to achieve an optimal $Θ(1/\sqrt{n})$ excess population loss given a sample of size $n$, much less is understood about the multi-pass version of the algorithm which is widely used in practice. Somewhat surprisingly, we show that in the general non-smooth case of SCO, just a few epochs of SGD can already hurt its out-of-sample performance significantly and lead to overfitting. In particular, using a step size $η= Θ(1/\sqrt{n})$, which gives the optimal rate after one pass, can lead to population loss as large as $Ω(1)$ after just one additional pass. More generally, we show that the population loss from the second pass onward is of the order $Θ(1/(ηT) + η\sqrt{T})$, where $T$ is the total number of steps. These results reveal a certain phase-transition in the out-of-sample behavior of SGD after the first epoch, as well as a sharp separation between the rates of overfitting in the smooth and non-smooth cases of SCO. Additionally, we extend our results to with-replacement SGD, proving that the same asymptotic bounds hold after $O(n \log n)$ steps. Finally, we also prove a lower bound of $Ω(η\sqrt{n})$ on the generalization gap of one-pass SGD in dimension $d = \smash{\widetilde O}(n)$, improving on recent results of Koren et al.(2022) and Schliserman et al.(2024).

View on arXiv PDF

Similar