LGMay 24, 2023

Local SGD Accelerates Convergence by Exploiting Second Order Information of the Loss Function

arXiv:2305.15013v22 citations
Originality Incremental advance
AI Analysis

This provides a theoretical foundation for optimizing distributed machine learning methods like federated learning, though it is incremental as it builds on existing analysis of L-SGD.

The paper tackles the problem of explaining why Local SGD (L-SGD) outperforms SGD in distributed learning by theoretically proving that L-SGD exploits second-order information of the loss function, leading to faster convergence, with experimental validation on two datasets.

With multiple iterations of updates, local statistical gradient descent (L-SGD) has been proven to be very effective in distributed machine learning schemes such as federated learning. In fact, many innovative works have shown that L-SGD with independent and identically distributed (IID) data can even outperform SGD. As a result, extensive efforts have been made to unveil the power of L-SGD. However, existing analysis failed to explain why the multiple local updates with small mini-batches of data (L-SGD) can not be replaced by the update with one big batch of data and a larger learning rate (SGD). In this paper, we offer a new perspective to understand the strength of L-SGD. We theoretically prove that, with IID data, L-SGD can effectively explore the second order information of the loss function. In particular, compared with SGD, the updates of L-SGD have much larger projection on the eigenvectors of the Hessian matrix with small eigenvalues, which leads to faster convergence. Under certain conditions, L-SGD can even approach the Newton method. Experiment results over two popular datasets validate the theoretical results.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes