LG OC MLJul 31, 2019

How Good is SGD with Random Shuffling?

arXiv:1908.00045v422.794 citations

Originality Incremental advance

AI Analysis

This work addresses a theoretical gap for machine learning practitioners using SGD heuristics, providing foundational insights into shuffling strategies, though it is incremental relative to recent works.

The paper tackles the performance of stochastic gradient descent (SGD) with random shuffling on smooth, strongly-convex finite-sum problems, proving lower bounds on optimization error that show a gap between single shuffling and repeated reshuffling, with rates like Ω(1/(nk)^2 + 1/nk^3) for repeated shuffling and Ω(1/nk^2) for single shuffling.

We study the performance of stochastic gradient descent (SGD) on smooth and strongly-convex finite-sum optimization problems. In contrast to the majority of existing theoretical works, which assume that individual functions are sampled with replacement, we focus here on popular but poorly-understood heuristics, which involve going over random permutations of the individual functions. This setting has been investigated in several recent works, but the optimal error rates remain unclear. In this paper, we provide lower bounds on the expected optimization error with these heuristics (using SGD with any constant step size), which elucidate their advantages and disadvantages. In particular, we prove that after $k$ passes over $n$ individual functions, if the functions are re-shuffled after every pass, the best possible optimization error for SGD is at least $Ω\left(1/(nk)^2+1/nk^3\right)$, which partially corresponds to recently derived upper bounds. Moreover, if the functions are only shuffled once, then the lower bound increases to $Ω(1/nk^2)$. Since there are strictly smaller upper bounds for repeated reshuffling, this proves an inherent performance gap between SGD with single shuffling and repeated shuffling. As a more minor contribution, we also provide a non-asymptotic $Ω(1/k^2)$ lower bound (independent of $n$) for the incremental gradient method, when no random shuffling takes place. Finally, we provide an indication that our lower bounds are tight, by proving matching upper bounds for univariate quadratic functions.

View on arXiv PDF

Similar