LGMay 24, 2023

Local SGD Accelerates Convergence by Exploiting Second Order Information of the Loss Function

arXiv:2305.15013v23.82 citations

Originality Incremental advance

AI Analysis

This provides a theoretical foundation for optimizing distributed machine learning methods like federated learning, though it is incremental as it builds on existing analysis of L-SGD.

The paper tackles the problem of explaining why Local SGD (L-SGD) outperforms SGD in distributed learning by theoretically proving that L-SGD exploits second-order information of the loss function, leading to faster convergence, with experimental validation on two datasets.

With multiple iterations of updates, local statistical gradient descent (L-SGD) has been proven to be very effective in distributed machine learning schemes such as federated learning. In fact, many innovative works have shown that L-SGD with independent and identically distributed (IID) data can even outperform SGD. As a result, extensive efforts have been made to unveil the power of L-SGD. However, existing analysis failed to explain why the multiple local updates with small mini-batches of data (L-SGD) can not be replaced by the update with one big batch of data and a larger learning rate (SGD). In this paper, we offer a new perspective to understand the strength of L-SGD. We theoretically prove that, with IID data, L-SGD can effectively explore the second order information of the loss function. In particular, compared with SGD, the updates of L-SGD have much larger projection on the eigenvectors of the Hessian matrix with small eigenvalues, which leads to faster convergence. Under certain conditions, L-SGD can even approach the Newton method. Experiment results over two popular datasets validate the theoretical results.

View on arXiv PDF

Similar