Muon Does Not Converge on Convex Lipschitz Functions

Tetiana Parshakova, Ahmed Khaled, Michael Crawshaw, Guillaume Garrigos, Robert M. Gower

arXiv:2605.0898073.81 citations

AI Analysis

For deep learning practitioners using Muon, the paper shows that convex Lipschitz theory is inadequate for explaining Muon's empirical success, redirecting focus to smoothness-based analyses.

Muon does not converge on convex Lipschitz functions regardless of learning rate schedule; error feedback restores convergence but degrades performance on CIFAR-10 and nanoGPT, suggesting Muon's success relies on smoothness rather than convex Lipschitz structure.

Muon and its variants have shown strong empirical performance in a variety of deep learning tasks. Existing convergence analyses of Muon rely on smoothness assumptions, though arguably the most successful function class for developing deep learning methods (such as AdaGrad, Shampoo, Schedule-Free and more) has been the class of convex and Lipschitz functions. In this paper we question whether the classical convex Lipschitz model is a useful one for understanding Muon. Our answer is no. We show that Muon does not converge on the class of convex and Lipschitz functions, regardless of the choice of learning rate schedule. We also show that error feedback restores convergence of Muon and all the non-Euclidean subgradient methods with momentum. However, this theoretical fix using error feedback degrades the performance of Muon in two representative settings for image classification (CIFAR-10) and language modeling (nanoGPT on FineWeb-Edu 10B). Our conclusion is that convex Lipschitz theory, despite having a prominent role in the design of practical methods for deep learning, is not the most suited one for Muon. This suggests that Muon's success must come from structure absent from this model, most plausibly related to smoothness.

View on arXiv PDF

Similar