LGMay 6, 2025

Iterative Orthogonalization Scaling Laws

arXiv:2505.04005v2
Originality Synthesis-oriented
AI Analysis

This is an incremental analysis of a potential bottleneck in an emerging optimizer, relevant for researchers working on large-scale optimization.

This paper identifies a scaling issue in the muon optimizer's iterative orthogonalization procedure, where singular values of random matrices shrink at larger scales, and demonstrates this behavior both theoretically and empirically.

The muon optimizer has picked up much attention as of late as a possible replacement to the seemingly omnipresent Adam optimizer. Recently, care has been taken to document the scaling laws of hyper-parameters under muon such as weight decay and learning rate. However, at much larger scales the iterative orthogonalization procedure present in muon may suffer a possible issue as the singular values of random matrices shrink with scale. This paper shows this scaling behavior theoretically and empirically on random matrices but does not suggest what to do about it.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes