MARS-M: When Variance Reduction Meets Matrices
This work addresses the problem of efficient optimization for large-scale neural network training, particularly for researchers and practitioners working with large language models and computer vision tasks, providing an incremental improvement over existing methods.
The authors tackled the problem of optimizing large-scale neural networks by introducing MARS-M, which integrates variance reduction with matrix-based preconditioning, achieving a convergence rate of $ ilde{mathcal{O}}(T^{-1/3})$ and improving performance on language modeling and computer vision tasks. MARS-M consistently yields lower losses and improved performance across various downstream benchmarks.
Matrix-based preconditioned optimizers, such as Muon, have recently been shown to be more efficient than scalar-based optimizers for training large-scale neural networks, including large language models (LLMs). On the other hand, recent benchmarks on optimizers for LLM pre-training have demonstrated that variance-reduction techniques such as MARS can achieve substantial speedups over standard optimizers that do not employ variance reduction. In this paper, to achieve the best of both worlds, we introduce MARS-M, a new optimizer that integrates the variance reduction technique in MARS with Muon. Under standard regularity conditions, we prove that Muon-M converges to a first-order stationary point at a rate of $\tilde{\mathcal{O}}(T^{-1/3})$, which improves upon $\tilde{\mathcal{O}}(T^{-1/4})$ rate attained by Muon. Our empirical results on language modeling and computer vision tasks demonstrate that MARS-M consistently yields lower losses and improved performance across various downstream benchmarks. The implementation of MARS-M is available at https://github.com/AGI-Arena/MARS/tree/main/MARS_M.