LGOCMLOct 20, 2025

MARS-M: When Variance Reduction Meets Matrices

arXiv:2510.21800v27 citationsh-index: 3Has Code
Originality Highly original
AI Analysis

This work addresses the problem of efficient optimization for large-scale neural network training, particularly for researchers and practitioners working with large language models and computer vision tasks, providing an incremental improvement over existing methods.

The authors tackled the problem of optimizing large-scale neural networks by introducing MARS-M, which integrates variance reduction with matrix-based preconditioning, achieving a convergence rate of $ ilde{mathcal{O}}(T^{-1/3})$ and improving performance on language modeling and computer vision tasks. MARS-M consistently yields lower losses and improved performance across various downstream benchmarks.

Matrix-based preconditioned optimizers, such as Muon, have recently been shown to be more efficient than scalar-based optimizers for training large-scale neural networks, including large language models (LLMs). On the other hand, recent benchmarks on optimizers for LLM pre-training have demonstrated that variance-reduction techniques such as MARS can achieve substantial speedups over standard optimizers that do not employ variance reduction. In this paper, to achieve the best of both worlds, we introduce MARS-M, a new optimizer that integrates the variance reduction technique in MARS with Muon. Under standard regularity conditions, we prove that Muon-M converges to a first-order stationary point at a rate of $\tilde{\mathcal{O}}(T^{-1/3})$, which improves upon $\tilde{\mathcal{O}}(T^{-1/4})$ rate attained by Muon. Our empirical results on language modeling and computer vision tasks demonstrate that MARS-M consistently yields lower losses and improved performance across various downstream benchmarks. The implementation of MARS-M is available at https://github.com/AGI-Arena/MARS/tree/main/MARS_M.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes