Muon+: Towards Better Muon via One Additional Normalization Step
This is an incremental enhancement to an existing optimizer, potentially benefiting researchers and practitioners training large language models by improving efficiency and performance.
The paper tackles the problem of improving the Muon optimizer for pre-training large language models by adding an extra normalization step, resulting in consistent boosts in training and validation perplexity across models from 60M to 1B parameters and token-to-parameter ratios up to 200.
The Muon optimizer has demonstrated promising performance in pre-training large language models through gradient (or momentum) orthogonalization. In this work, we propose a simple yet effective enhancement to Muon, namely Muon+, which introduces an additional normalization step after orthogonalization. We demonstrate the effectiveness of Muon+ through extensive pre-training experiments across a wide range of model scales and architectures. Our evaluation includes GPT-style models ranging from 130M to 774M parameters and LLaMA-style models ranging from 60M to 1B parameters. We comprehensively evaluate the effectiveness of Muon+ in the compute-optimal training regime and further extend the token-to-parameter (T2P) ratio to an industrial level of $\approx 200$. Experimental results show that Muon+ provides a consistent boost on training and validation perplexity over Muon. We provide our code here: https://github.com/K1seki221/MuonPlus.