How To Make the Gradients Small Stochastically: Even Faster Convex and Nonconvex SGD
This work provides faster convergence rates for SGD in optimization, which is incremental but important for machine learning practitioners dealing with large-scale stochastic problems.
The paper tackles the problem of making gradients small in stochastic gradient descent (SGD) for convex and nonconvex objectives, achieving a near-optimal rate of ̃O(ε^{-2}) for convex cases and ̃O(ε^{-3.5}) for nonconvex cases, improving upon previous rates of O(ε^{-8/3}) and ̃O(ε^{-4}) respectively.
Stochastic gradient descent (SGD) gives an optimal convergence rate when minimizing convex stochastic objectives $f(x)$. However, in terms of making the gradients small, the original SGD does not give an optimal rate, even when $f(x)$ is convex. If $f(x)$ is convex, to find a point with gradient norm $\varepsilon$, we design an algorithm SGD3 with a near-optimal rate $\tilde{O}(\varepsilon^{-2})$, improving the best known rate $O(\varepsilon^{-8/3})$ of [18]. If $f(x)$ is nonconvex, to find its $\varepsilon$-approximate local minimum, we design an algorithm SGD5 with rate $\tilde{O}(\varepsilon^{-3.5})$, where previously SGD variants only achieve $\tilde{O}(\varepsilon^{-4})$ [6, 15, 33]. This is no slower than the best known stochastic version of Newton's method in all parameter regimes [30].