LGFeb 6

Uniform Spectral Growth and Convergence of Muon in LoRA-Style Matrix Factorization

Changmin Kang, Jihun Yun, Baekrok Shin, Yeseul Cho, Chulhee Yun

arXiv:2602.06385v13.85 citationsh-index: 3

Originality Incremental advance

AI Analysis

This work provides theoretical insights into optimization for low-rank adaptation in LLMs, which is incremental but addresses a known bottleneck in understanding practical optimizers.

The paper tackled the problem of understanding the dynamics of spectral gradient descent in LoRA fine-tuning of large language models, finding that singular values of the LoRA product grow nearly uniformly, leading to smaller singular values converging earlier than larger ones, and proved global convergence under certain conditions.

Spectral gradient descent (SpecGD) orthogonalizes the matrix parameter updates and has inspired practical optimizers such as Muon. They often perform well in large language model (LLM) training, but their dynamics remain poorly understood. In the low-rank adaptation (LoRA) setting, where weight updates are parameterized as a product of two low-rank factors, we find a distinctive spectral phenomenon under Muon in LoRA fine-tuning of LLMs: singular values of the LoRA product show near-uniform growth across the spectrum, despite orthogonalization being performed on the two factors separately. Motivated by this observation, we analyze spectral gradient flow (SpecGF)-a continuous-time analogue of SpecGD-in a simplified LoRA-style matrix factorization setting and prove "equal-rate" dynamics: all singular values grow at equal rates up to small deviations. Consequently, smaller singular values attain their target values earlier than larger ones, sharply contrasting with the largest-first stepwise learning observed in standard gradient flow. Moreover, we prove that SpecGF in our setting converges to global minima from almost all initializations, provided the factor norms remain bounded; with $\ell_2$ regularization, we obtain global convergence. Lastly, we corroborate our theory with experiments in the same setting.

View on arXiv PDF

Similar