Optimizing Stochastic Gradient Push under Broadcast Communications

arXiv:2604.1554939.0h-index: 1

AI Analysis

For practitioners of decentralized federated learning in wireless networks, this work improves convergence time by leveraging directed communication graphs, offering a practical advantage over existing symmetric-matrix approaches.

This paper addresses minimizing convergence time for decentralized federated learning under broadcast communications by designing mixing matrices for stochastic gradient push (SGP), which allows directed graphs. The proposed algorithm reduces convergence time notably compared to state-of-the-art methods without compromising model quality.

We consider the problem of minimizing the convergence time for decentralized federated learning (DFL) in wireless networks under broadcast communications, with focus on mixing matrix design. The mixing matrix is a critical hyperparameter for DFL that simultaneously controls the convergence rate across iterations and the communication demand per iteration, both strongly influencing the convergence time. Although the problem has been studied previously, existing solutions are mostly designed for decentralized parallel stochastic gradient descent (D-PSGD), which requires the mixing matrix to be symmetric and doubly stochastic. These constraints confine the activated communication graph to undirected (i.e., bidirected) graphs, which limits design flexibility. In contrast, we consider mixing matrix design for stochastic gradient push (SGP), which allows asymmetric mixing matrices and hence directed communication graphs. By analyzing how the convergence rate of SGP depends on the mixing matrices, we extract an objective function that explicitly depends on graph-theoretic parameters of the activated communication graph, based on which we develop an efficient design algorithm with performance guarantees. Our evaluations based on real data show that the proposed solution can notably reduce the convergence time compared to the state of the art without compromising the quality of the trained model.

View on arXiv PDF

Similar