Muon Converges under Heavy-Tailed Noise: Nonconvex Hölder-Smooth Empirical Risk Minimization
This work addresses stability and efficiency in large-scale deep learning under realistic noise conditions, but it is incremental as it builds on the existing Muon optimizer.
The paper tackles the problem of training deep neural networks with heavy-tailed stochastic noise by analyzing Muon, an optimizer that enforces orthogonality in updates. It shows that Muon converges to a stationary point under heavy-tailed noise conditions and converges faster than mini-batch SGD.
Muon is a recently proposed optimizer that enforces orthogonality in parameter updates by projecting gradients onto the Stiefel manifold, leading to stable and efficient training in large-scale deep neural networks. Meanwhile, the previously reported results indicated that stochastic noise in practical machine learning may exhibit heavy-tailed behavior, violating the bounded-variance assumption. In this paper, we consider the problem of minimizing a nonconvex Hölder-smooth empirical risk that works well with the heavy-tailed stochastic noise. We then show that Muon converges to a stationary point of the empirical risk under the boundedness condition accounting for heavy-tailed stochastic noise. In addition, we show that Muon converges faster than mini-batch SGD.