LGMLMar 30

MuonEq: Balancing Before Orthogonalization with Lightweight Equilibration

arXiv:2603.2825480.04 citationsh-index: 2
AI Analysis

This work addresses optimization challenges in training large language models, but it is incremental as it builds on existing Muon methods with specific enhancements.

The paper tackled the problem of improving orthogonalized-update optimizers like Muon by introducing lightweight pre-orthogonalization equilibration schemes, resulting in faster convergence and lower validation perplexity in LLaMA2 pretraining on C4 for 130M and 350M models.

Orthogonalized-update optimizers such as Muon improve training of matrix-valued parameters, but existing extensions mostly act either after orthogonalization by rescaling updates or before it with heavier whitening-based preconditioners. We introduce {\method}, a lightweight family of pre-orthogonalization equilibration schemes for Muon in three forms: two-sided row/column normalization (RC), row normalization (R), and column normalization (C). These variants rebalance the momentum matrix before finite-step Newton--Schulz using row/column squared-norm statistics and only $\mathcal{O}(m+n)$ auxiliary state. We show that finite-step orthogonalization is governed by input spectral properties, especially stable rank and condition number, and that row/column normalization is a zeroth-order whitening surrogate that removes marginal scale mismatch. For the hidden matrix weights targeted by {\method}, the row-normalized variant R is the natural default and preserves the $\widetilde{\mathcal{O}}(T^{-1/4})$ stationarity guarantee of Muon-type methods. In LLaMA2 pretraining on C4, the default R variant consistently outperforms Muon on 130M and 350M models, yielding faster convergence and lower validation perplexity.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes