OCLGApr 15

Gradient Descent's Last Iterate is Often (slightly) Suboptimal

arXiv:2604.1387052.51 citationsh-index: 7
Predicted impact top 12% in OC · last 90 daysOriginality Highly original
AI Analysis

Resolves a conjecture by Jain et al. on the impossibility of optimal anytime last-iterate convergence for SGD, establishing fundamental limitations for practitioners using standard stepsize schedules.

The paper proves that for convex Lipschitz minimization, no stepsize sequence can achieve the optimal last-iterate convergence rate for SGD without prior knowledge of the time horizon, and even for GD an excess poly-log factor is unavoidable for anytime guarantees.

We consider the well-studied setting of minimizing a convex Lipschitz function using either gradient descent (GD) or its stochastic variant (SGD), and examine the last iterate convergence. By now, it is known that standard stepsize choices lead to a last iterate convergence rate of $\log T/\sqrt{T}$ after $T$ steps. A breakthrough result of Jain et al. [2019] recovered the optimal $1/\sqrt{T}$ rate by constructing a non-standard stepsize sequence. However, this sequence requires choosing $T$ in advance, as opposed to common stepsize schedules which apply for any time horizon. Moreover, Jain et al. conjectured that without prior knowledge of $T$, no stepsize sequence can ensure the optimal error for SGD's last iterate, a claim which so far remained unproven. We prove this conjecture, and in fact show that even in the noiseless case of GD, it is impossible to avoid an excess poly-log factor in $T$ when considering an anytime last iterate guarantee. Our proof further suggests that such (slightly) suboptimal stopping times are unavoidably common.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes