Is Local SGD Better than Minibatch SGD?
This work addresses a gap in theoretical foundations for distributed optimization methods, providing insights for researchers and practitioners in machine learning and federated learning, though it is incremental in clarifying existing methods.
The paper tackles the problem of comparing local SGD to minibatch SGD in distributed optimization, finding that for quadratic objectives, local SGD strictly dominates minibatch SGD and is minimax optimal, while for general convex objectives, it sometimes improves but does not always dominate, with a lower bound showing worse performance in some cases.
We study local SGD (also known as parallel SGD and federated averaging), a natural and frequently used stochastic distributed optimization method. Its theoretical foundations are currently lacking and we highlight how all existing error guarantees in the convex setting are dominated by a simple baseline, minibatch SGD. (1) For quadratic objectives we prove that local SGD strictly dominates minibatch SGD and that accelerated local SGD is minimax optimal for quadratics; (2) For general convex objectives we provide the first guarantee that at least sometimes improves over minibatch SGD; (3) We show that indeed local SGD does not dominate minibatch SGD by presenting a lower bound on the performance of local SGD that is worse than the minibatch SGD guarantee.