Unified Optimal Analysis of the (Stochastic) Gradient Method
This provides a unified analysis for gradient methods, but it is incremental as it refines existing proofs without introducing new methods.
The paper tackles the problem of proving convergence for stochastic gradient descent (SGD) on convex functions under a mild smoothness assumption, showing that with careful stepsizes, SGD converges at a rate of O(LR^2 exp[-μ/(4L)T] + σ^2/(μT)), matching the best known iteration complexity up to constants.
In this note we give a simple proof for the convergence of stochastic gradient (SGD) methods on $μ$-convex functions under a (milder than standard) $L$-smoothness assumption. We show that for carefully chosen stepsizes SGD converges after $T$ iterations as $O\left( LR^2 \exp \bigl[-\fracμ{4L}T\bigr] + \frac{σ^2}{μT} \right)$ where $σ^2$ measures the variance in the stochastic noise. For deterministic gradient descent (GD) and SGD in the interpolation setting we have $σ^2 =0$ and we recover the exponential convergence rate. The bound matches with the best known iteration complexity of GD and SGD, up to constants.