LGSep 27, 2025

Effective Quantization of Muon Optimizer States

Aman Gupta, Rafael Celente, Abhishek Shivanna, D. T. Braithwaite, Gregory Dexter, Shao Tang, Hiroto Udagawa, Daniel Silva, Rohan Ramanath, S. Sathiya Keerthi

arXiv:2509.23106v114.44 citationsh-index: 11

Originality Incremental advance

AI Analysis

This work addresses memory efficiency for large-scale model training, particularly for LLMs, but is incremental as it adapts existing quantization techniques to a new optimizer.

The paper tackles the memory overhead of the Muon optimizer by introducing an 8-bit quantized version using blockwise quantization, achieving a ~74% reduction in memory footprint while maintaining performance comparable to full-precision Muon in pre-training and fine-tuning experiments.

The Muon optimizer, based on matrix orthogonalization, has recently shown faster convergence and up to 2x computational efficiency over AdamW in LLM pretraining. Like AdamW, Muon is stateful, requiring storage of both model weights and accumulated gradients. While 8-bit AdamW variants mitigate this overhead using blockwise quantization, they are typically stable only under dynamic quantization - which improves stability on linear quantization for extreme values. In this paper, we introduce the 8-bit Muon optimizer using blockwise quantization, supporting both linear and dynamic schemes. We demonstrate that 8-bit Muon maintains stability under both, while delivering $\sim$74\% reduction in memory footprint compared to full-precision Muon. In extensive experiments, 8-bit Muon closely matches the performance of Muon while outperforming AdamW and 8-bit AdamW in pre-training a 1.6B model on 4B FineWeb tokens. It also shows competitive results when fine-tuning the Llama 3.2 3B model on post-training data. We also provide a theoretical perspective to help explain this robustness under quantization.

View on arXiv PDF

Similar