Minibatch and Local SGD: Algorithmic Stability and Linear Speedup in Generalization
This work addresses the learnability of parallel optimization methods for large-scale data, providing theoretical insights into generalization, though it appears incremental as it builds on existing stability analysis.
The paper tackles the problem of understanding the stability and generalization of minibatch and local SGD for parallel optimization, showing that these methods achieve a linear speedup to attain optimal risk bounds.
The increasing scale of data propels the popularity of leveraging parallelism to speed up the optimization. Minibatch stochastic gradient descent (minibatch SGD) and local SGD are two popular methods for parallel optimization. The existing theoretical studies show a linear speedup of these methods with respect to the number of machines, which, however, is measured by optimization errors in a multi-pass setting. As a comparison, the stability and generalization of these methods are much less studied. In this paper, we study the stability and generalization analysis of minibatch and local SGD to understand their learnability by introducing an expectation-variance decomposition. We incorporate training errors into the stability analysis, which shows how small training errors help generalization for overparameterized models. We show minibatch and local SGD achieve a linear speedup to attain the optimal risk bounds.