Muon is Scalable for LLM Training
This work addresses the problem of efficient and scalable training for large language models, offering a practical optimizer with open-source implementation and model releases, though it is incremental as it builds on prior Muon results.
The authors tackled the scalability of the Muon optimizer for large language model training by introducing weight decay and per-parameter update scaling, achieving approximately 2x computational efficiency compared to AdamW in compute-optimal training and improving the Pareto frontier with a 3B/16B-parameter MoE model trained on 5.7T tokens.
Recently, the Muon optimizer based on matrix orthogonalization has demonstrated strong results in training small-scale language models, but the scalability to larger models has not been proven. We identify two crucial techniques for scaling up Muon: (1) adding weight decay and (2) carefully adjusting the per-parameter update scale. These techniques allow Muon to work out-of-the-box on large-scale training without the need of hyper-parameter tuning. Scaling law experiments indicate that Muon achieves $\sim\!2\times$ computational efficiency compared to AdamW with compute optimal training. Based on these improvements, we introduce Moonlight, a 3B/16B-parameter Mixture-of-Expert (MoE) model trained with 5.7T tokens using Muon. Our model improves the current Pareto frontier, achieving better performance with much fewer training FLOPs compared to prior models. We open-source our distributed Muon implementation that is memory optimal and communication efficient. We also release the pretrained, instruction-tuned, and intermediate checkpoints to support future research.