LGMay 29

How Much Orthogonalization Does Muon Need?

arXiv:2606.0037172.4

AI Analysis

This work provides a practical, low-cost orthogonalization variant for Muon optimizers, showing that expensive high-accuracy polar decomposition is unnecessary for training quality in the tested settings.

The authors investigate how much orthogonalization is needed in Muon optimizers, proposing a five-step cubic Newton-Schulz schedule that uses fewer matrix multiplications than standard quintic iterations. They find that training quality is not monotonically tied to polar-decomposition accuracy, with cubic5 matching Muon-Jordan within ~10^{-3} validation loss on hybrid MoE/Mamba models up to 4B parameters.

Muon optimizers improve neural-network training by replacing ill-conditioned momentum updates with approximately semi-orthogonal updates. This motivates a practical question: how much orthogonalization does Muon actually require? We study this question using a relaxed cubic Newton--Schulz schedule derived directly for Muon's low precision singular value band. The resulting five-step cubic construction uses ten dominant matrix multiplications, compared with fifteen for five quintic Newton--Schulz iterations. The cubic schedule is not intended as a more accurate polar solver; instead, it is a principled low-cost variant that lets us probe the relation between polar accuracy, spectral shaping, and training quality. Across synthetic diagnostics, NanoGPT ablations, and training experiments on hybrid MoE/Mamba models, we find that training quality is not governed monotonically by polar-decomposition accuracy: truncated Polar Express, Muon-Jordan, cubic Newton--Schulz, and an explicit FP32 SVD polar factor can reach nearly indistinguishable final loss on GPT-2 Small, and cubic5 matches the Muon-Jordan quintic update within about $10^{-3}$ validation loss on hybrid MoE/Mamba models with one billion to four billion parameters. These results support cubic5 as a practical low-cost Muon orthogonalization variant, with empirical evidence of training-quality parity in the settings tested.

View on arXiv PDF

Similar