ML IT LG OCMay 29, 2025

On the Convergence Analysis of Muon

Wei Shen, Ruichuan Huang, Minhui Huang, Cong Shen, Jiawei Zhang

arXiv:2505.23737v136.323 citationsh-index: 2

Originality Incremental advance

AI Analysis

This work addresses a theoretical gap for researchers and practitioners interested in optimization methods for neural networks, though it is incremental as it builds on existing empirical evidence for Muon.

The paper tackles the lack of theoretical understanding of the Muon optimizer, which is designed for matrix-structured parameters in neural networks, by providing a convergence analysis that shows Muon outperforms Gradient Descent under conditions related to Hessian matrix structures.

The majority of parameters in neural networks are naturally represented as matrices. However, most commonly used optimizers treat these matrix parameters as flattened vectors during optimization, potentially overlooking their inherent structural properties. Recently, an optimizer called Muon has been proposed, specifically designed to optimize matrix-structured parameters. Extensive empirical evidence shows that Muon can significantly outperform traditional optimizers when training neural networks. Nonetheless, the theoretical understanding of Muon's convergence behavior and the reasons behind its superior performance remain limited. In this work, we present a comprehensive convergence rate analysis of Muon and its comparison with Gradient Descent (GD). We further characterize the conditions under which Muon can outperform GD. Our theoretical results reveal that Muon can benefit from the low-rank and approximate blockwise diagonal structure of Hessian matrices -- phenomena widely observed in practical neural network training. Our experimental results support and corroborate the theoretical findings.

View on arXiv PDF

Similar