LG CR DC MLOct 3, 2022

Distributed Non-Convex Optimization with One-Bit Compressors on Heterogeneous Data: Efficient and Resilient Algorithms

arXiv:2210.00665v25.86 citationsh-index: 18

Originality Incremental advance

AI Analysis

This addresses communication bottlenecks and security issues in federated learning for heterogeneous data, offering incremental improvements over existing sign-based methods.

The paper tackles communication-efficient distributed non-convex optimization for federated learning by proposing two algorithms, Ada-StoSign and β-StoSign, that compress gradients into bit vectors; Ada-StoSign achieves a convergence rate of O(log T/√T + 1/√M), outperforming prior sign-based methods with O(T^{-1/4}) when M is large, and β-StoSign provides Byzantine resilience and privacy assurances.

Federated Learning (FL) is a nascent decentralized learning framework under which a massive collection of heterogeneous clients collaboratively train a model without revealing their local data. Scarce communication, privacy leakage, and Byzantine attacks are the key bottlenecks of system scalability. In this paper, we focus on communication-efficient distributed (stochastic) gradient descent for non-convex optimization, a driving force of FL. We propose two algorithms, named {\em Adaptive Stochastic Sign SGD (Ada-StoSign)} and {\em $β$-Stochastic Sign SGD ($β$-StoSign)}, each of which compresses the local gradients into bit vectors. To handle unbounded gradients, Ada-StoSign uses a novel norm tracking function that adaptively adjusts a coarse estimation on the $\ell_{\infty}$ of the local gradients - a key parameter used in gradient compression. We show that Ada-StoSign converges in expectation with a rate $O(\log T/\sqrt{T} + 1/\sqrt{M})$, where $M$ is the number of clients. To the best of our knowledge, when $M$ is sufficiently large, Ada-StoSign outperforms the state-of-the-art sign-based method whose convergence rate is $O(T^{-1/4})$. Under bounded gradient assumption, $β$-StoSign achieves quantifiable Byzantine resilience and privacy assurances, and works with partial client participation and mini-batch gradients which could be unbounded. We corroborate and complement our theories by experiments on MNIST and CIFAR-10 datasets.

View on arXiv PDF

Similar