LGMLMay 4, 2025

Practical Efficiency of Muon for Pretraining

arXiv:2505.02222v451 citationsh-index: 5
Originality Incremental advance
AI Analysis

This work addresses the problem of more economical training for large-scale machine learning models, though it appears incremental as it builds on existing optimization methods.

The paper demonstrates that Muon, a second-order optimizer, expands the Pareto frontier over AdamW for compute-time tradeoffs, retaining data efficiency at large batch sizes beyond the critical batch size while remaining computationally efficient. Experiments validate these findings with models up to four billion parameters.

We demonstrate that Muon, the simplest instantiation of a second-order optimizer, explicitly expands the Pareto frontier over AdamW on the compute-time tradeoff. We find that Muon is more effective than AdamW in retaining data efficiency at large batch sizes, far beyond the so-called critical batch size, while remaining computationally efficient, thus enabling more economical training. We study the combination of Muon and the maximal update parameterization (muP) for efficient hyperparameter transfer and present a simple telescoping algorithm that accounts for all sources of error in muP while introducing only a modest overhead in resources. We validate our findings through extensive experiments with model sizes up to four billion parameters and ablations on the data distribution and architecture.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes