OCLGDec 18, 2025

Muon is Provably Faster with Momentum Variance Reduction

arXiv:2512.16598v18 citationsh-index: 2
Originality Incremental advance
AI Analysis

This work provides incremental improvements to optimization methods for training large language models, potentially enhancing efficiency in deep learning applications.

The paper tackles the problem of improving deep learning optimizers like Muon and Scion by incorporating momentum variance reduction (MVR) into the Gluon framework, resulting in a provable convergence rate improvement from O(1/K^{1/4}) to O(1/K^{1/3}) in non-convex cases.

Recent empirical research has demonstrated that deep learning optimizers based on the linear minimization oracle (LMO) over specifically chosen Non-Euclidean norm balls, such as Muon and Scion, outperform Adam-type methods in the training of large language models. In this work, we show that such optimizers can be provably improved by replacing their vanilla momentum by momentum variance reduction (MVR). Instead of proposing and analyzing MVR variants of Muon and Scion separately, we incorporate MVR into the recently proposed Gluon framework, which captures Muon, Scion and other specific Non-Euclidean LMO-based methods as special cases, and at the same time works with a more general smoothness assumption which better captures the layer-wise structure of neural networks. In the non-convex case, we incorporate MVR into Gluon in three different ways. All of them improve the convergence rate from ${\cal O} (\frac{1}{K^{1/4}})$ to ${\cal O} (\frac{1}{K^{1/3}})$. Additionally, we provide improved rates in the star-convex case. Finally, we conduct several numerical experiments that verify the superior performance of our proposed algorithms in terms of iteration complexity.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes