LGMay 6, 2025

Iterative Orthogonalization Scaling Laws

arXiv:2505.04005v2

Originality Synthesis-oriented

AI Analysis

This is an incremental analysis of a potential bottleneck in an emerging optimizer, relevant for researchers working on large-scale optimization.

This paper identifies a scaling issue in the muon optimizer's iterative orthogonalization procedure, where singular values of random matrices shrink at larger scales, and demonstrates this behavior both theoretically and empirically.

The muon optimizer has picked up much attention as of late as a possible replacement to the seemingly omnipresent Adam optimizer. Recently, care has been taken to document the scaling laws of hyper-parameters under muon such as weight decay and learning rate. However, at much larger scales the iterative orthogonalization procedure present in muon may suffer a possible issue as the singular values of random matrices shrink with scale. This paper shows this scaling behavior theoretically and empirically on random matrices but does not suggest what to do about it.

View on arXiv PDF

Similar