LGCYDCSYMLJun 5, 2023

Decentralized SGD and Average-direction SAM are Asymptotically Equivalent

arXiv:2306.02913v521 citationsh-index: 35Has Code
Originality Highly original
AI Analysis

This work addresses the problem of understanding and improving generalization in decentralized machine learning for large-scale collaborative learning systems, offering a novel theoretical perspective rather than incremental improvements.

The paper challenges the belief that decentralized learning undermines generalization by proving that decentralized stochastic gradient descent (D-SGD) is asymptotically equivalent to an average-direction Sharpness-aware minimization (SAM) algorithm, revealing advantages like improved posterior estimation and gradient smoothing, with potential generalization benefits over centralized SGD in large-batch scenarios.

Decentralized stochastic gradient descent (D-SGD) allows collaborative learning on massive devices simultaneously without the control of a central server. However, existing theories claim that decentralization invariably undermines generalization. In this paper, we challenge the conventional belief and present a completely new perspective for understanding decentralized learning. We prove that D-SGD implicitly minimizes the loss function of an average-direction Sharpness-aware minimization (SAM) algorithm under general non-convex non-$β$-smooth settings. This surprising asymptotic equivalence reveals an intrinsic regularization-optimization trade-off and three advantages of decentralization: (1) there exists a free uncertainty evaluation mechanism in D-SGD to improve posterior estimation; (2) D-SGD exhibits a gradient smoothing effect; and (3) the sharpness regularization effect of D-SGD does not decrease as total batch size increases, which justifies the potential generalization benefit of D-SGD over centralized SGD (C-SGD) in large-batch scenarios. The code is available at https://github.com/Raiden-Zhu/ICML-2023-DSGD-and-SAM.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes