LG CLMay 8

OrScale: Orthogonalised Optimization with Layer-Wise Trust-Ratio Scaling

arXiv:2605.0781557.1

Predicted impact top 41% in LG · last 90 daysOriginality Incremental advance

AI Analysis

For practitioners training large neural networks, OrScale provides a principled layer-adaptive optimizer that improves upon Muon and AdamW with consistent gains across vision and language tasks.

OrScale extends Muon with a trust-ratio mechanism that uses the Frobenius norm of the actual parameter-space direction as denominator, improving neural network training. It achieves 94.05% top-1 accuracy on CIFAR-10/DavidNet (vs. Muon's 93.70%) and outperforms Muon+Moonlight at three of four scales (125M-1.1B) and AdamW at all scales on FineWeb-Edu pre-training.

Muon improves neural-network training by orthogonalizing matrix-valued updates, but it leaves each layer's update magnitude controlled mostly by a global learning rate. We introduce OrScale, a trust-ratio extension of Muon built on a simple rule: the denominator of a layer-wise ratio should measure the Frobenius norm of the actual parameter-space direction that will be applied. This yields OrScale for general matrix layers and OrScale-LM for language models, where Moonlight shape scaling is combined with one-time per-layer calibration so every trust ratio starts at one. We analyze why three natural Muon-LAMB hybrids fail through shape-degenerate denominators, raw-momentum clip saturation, and decoupled weight-decay runaway, and show that the real-update-direction denominator with coupled weight decay avoids these failures. Theoretically, OrScale admits an O(1/sqrt(T)) nonconvex convergence guarantee in a nuclear-norm criterion, a strict layer-adaptive descent gain under measurable layer heterogeneity, and calibration properties that preserve muP-style learning-rate transfer at initialization. Empirically, OrScale ranks first on CIFAR-10/DavidNet across three seeds, improving Muon from 93.70% to 94.05% validation top-1, and OrScale-LM improves FineWeb-Edu pre-training versus Muon+Moonlight at three of four scales from 125M to 1.1B parameters while outperforming AdamW at every scale.

View on arXiv PDF

Similar