On Provable Benefits of Muon in Federated Learning
It addresses the performance gap of Muon in federated learning, which is incremental as it adapts an existing optimizer to a new setting.
This paper tackles the problem of applying the Muon optimizer to federated learning by proposing FedMuon, establishing its convergence rate for nonconvex problems and showing it accommodates heavy-tailed noise, with experiments validating its effectiveness across neural network architectures.
The recently introduced optimizer, Muon, has gained increasing attention due to its superior performance across a wide range of applications. However, its effectiveness in federated learning remains unexplored. To address this gap, this paper investigates the performance of Muon in the federated learning setting. Specifically, we propose a new algorithm, FedMuon, and establish its convergence rate for nonconvex problems. Our theoretical analysis reveals multiple favorable properties of FedMuon. In particular, due to its orthonormalized update direction, the learning rate of FedMuon is independent of problem-specific parameters, and, importantly, it can naturally accommodate heavy-tailed noise. The extensive experiments on a variety of neural network architectures validate the effectiveness of the proposed algorithm.