Minibatch vs Local SGD with Shuffling: Tight Convergence Bounds and Beyond
This work addresses optimization efficiency in distributed learning, particularly for federated averaging, by providing tight theoretical guarantees for practical shuffling methods, though it is incremental as it builds on existing SGD analyses.
The paper tackles the problem of analyzing shuffling-based variants of minibatch and local SGD for distributed learning, showing that these methods converge faster than with-replacement counterparts for smooth functions under the Polyak-Łojasiewicz condition, with tight convergence bounds and an algorithmic modification achieving faster rates in near-homogeneous settings.
In distributed learning, local SGD (also known as federated averaging) and its simple baseline minibatch SGD are widely studied optimization methods. Most existing analyses of these methods assume independent and unbiased gradient estimates obtained via with-replacement sampling. In contrast, we study shuffling-based variants: minibatch and local Random Reshuffling, which draw stochastic gradients without replacement and are thus closer to practice. For smooth functions satisfying the Polyak-Łojasiewicz condition, we obtain convergence bounds (in the large epoch regime) which show that these shuffling-based variants converge faster than their with-replacement counterparts. Moreover, we prove matching lower bounds showing that our convergence analysis is tight. Finally, we propose an algorithmic modification called synchronized shuffling that leads to convergence rates faster than our lower bounds in near-homogeneous settings.