LGSep 19, 2025

On the Convergence of Muon and Beyond

arXiv:2509.15816v317 citationsh-index: 2
Originality Incremental advance
AI Analysis

This work provides foundational theoretical insights for optimizing neural networks, addressing a bottleneck in understanding Muon-style methods, though it is incremental in extending existing frameworks.

The paper tackled the gap between practical performance and theoretical understanding of the Muon optimizer by analyzing momentum-based variance-reduced variants, proving that Muon-MVR2 achieves optimal iteration complexity of ω(T^{-1/3}) and providing last-iterate convergence guarantees under the Polyak-Łojasiewicz condition.

The Muon optimizer has demonstrated remarkable empirical success in handling matrix-structured parameters for training neural networks. However, a significant gap remains between its practical performance and theoretical understanding. Existing analyses show that the Muon variants achieve only a suboptimal iteration complexity of $\mathcal{O}(T^{-1/4})$ in stochastic non-convex settings, where $T$ denotes the number of iterations. To explore the theoretical limits of the Muon framework, we analyze two Momentum-based Variance-Reduced variants: a one-batch version (Muon-MVR1) and a two-batch version (Muon-MVR2). We provide the first rigorous proof that incorporating variance reduction enables Muon-MVR2 to attain the optimal iteration complexity of $\tilde{\mathcal{O}}(T^{-1/3})$, thereby matching the theoretical lower bound for this class of problems. Furthermore, our analysis establishes last-iterate convergence guarantees for Muon variants under the Polyak-Łojasiewicz (PŁ) condition. Extensive experiments on vision (CIFAR-10) and language (C4) benchmarks corroborate our theoretical findings on per-iteration convergence. Overall, this work offers the first proof of optimality for a Muon-style optimizer and clarifies the path toward developing more practically efficient, accelerated variants.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes