OC LGApr 15

Gradient Descent's Last Iterate is Often (slightly) Suboptimal

arXiv:2604.1387052.51 citationsh-index: 7

Predicted impact top 12% in OC · last 90 daysOriginality Highly original

AI Analysis

Resolves a conjecture by Jain et al. on the impossibility of optimal anytime last-iterate convergence for SGD, establishing fundamental limitations for practitioners using standard stepsize schedules.

The paper proves that for convex Lipschitz minimization, no stepsize sequence can achieve the optimal last-iterate convergence rate for SGD without prior knowledge of the time horizon, and even for GD an excess poly-log factor is unavoidable for anytime guarantees.

We consider the well-studied setting of minimizing a convex Lipschitz function using either gradient descent (GD) or its stochastic variant (SGD), and examine the last iterate convergence. By now, it is known that standard stepsize choices lead to a last iterate convergence rate of $\log T/\sqrt{T}$ after $T$ steps. A breakthrough result of Jain et al. [2019] recovered the optimal $1/\sqrt{T}$ rate by constructing a non-standard stepsize sequence. However, this sequence requires choosing $T$ in advance, as opposed to common stepsize schedules which apply for any time horizon. Moreover, Jain et al. conjectured that without prior knowledge of $T$, no stepsize sequence can ensure the optimal error for SGD's last iterate, a claim which so far remained unproven. We prove this conjecture, and in fact show that even in the noiseless case of GD, it is impossible to avoid an excess poly-log factor in $T$ when considering an anytime last iterate guarantee. Our proof further suggests that such (slightly) suboptimal stopping times are unavoidably common.

View on arXiv PDF

Similar