LGAICLFeb 24, 2025

Muon is Scalable for LLM Training

arXiv:2502.16982v1251 citationsh-index: 10Has Code
Originality Incremental advance
AI Analysis

This work addresses the problem of efficient and scalable training for large language models, offering a practical optimizer with open-source implementation and model releases, though it is incremental as it builds on prior Muon results.

The authors tackled the scalability of the Muon optimizer for large language model training by introducing weight decay and per-parameter update scaling, achieving approximately 2x computational efficiency compared to AdamW in compute-optimal training and improving the Pareto frontier with a 3B/16B-parameter MoE model trained on 5.7T tokens.

Recently, the Muon optimizer based on matrix orthogonalization has demonstrated strong results in training small-scale language models, but the scalability to larger models has not been proven. We identify two crucial techniques for scaling up Muon: (1) adding weight decay and (2) carefully adjusting the per-parameter update scale. These techniques allow Muon to work out-of-the-box on large-scale training without the need of hyper-parameter tuning. Scaling law experiments indicate that Muon achieves $\sim\!2\times$ computational efficiency compared to AdamW with compute optimal training. Based on these improvements, we introduce Moonlight, a 3B/16B-parameter Mixture-of-Expert (MoE) model trained with 5.7T tokens using Muon. Our model improves the current Pareto frontier, achieving better performance with much fewer training FLOPs compared to prior models. We open-source our distributed Muon implementation that is memory optimal and communication efficient. We also release the pretrained, instruction-tuned, and intermediate checkpoints to support future research.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes