LGCLMay 8

OrScale: Orthogonalised Optimization with Layer-Wise Trust-Ratio Scaling

arXiv:2605.0781557.1
Predicted impact top 41% in LG · last 90 daysOriginality Incremental advance
AI Analysis

For practitioners training large neural networks, OrScale provides a principled layer-adaptive optimizer that improves upon Muon and AdamW with consistent gains across vision and language tasks.

OrScale extends Muon with a trust-ratio mechanism that uses the Frobenius norm of the actual parameter-space direction as denominator, improving neural network training. It achieves 94.05% top-1 accuracy on CIFAR-10/DavidNet (vs. Muon's 93.70%) and outperforms Muon+Moonlight at three of four scales (125M-1.1B) and AdamW at all scales on FineWeb-Edu pre-training.

Muon improves neural-network training by orthogonalizing matrix-valued updates, but it leaves each layer's update magnitude controlled mostly by a global learning rate. We introduce OrScale, a trust-ratio extension of Muon built on a simple rule: the denominator of a layer-wise ratio should measure the Frobenius norm of the actual parameter-space direction that will be applied. This yields OrScale for general matrix layers and OrScale-LM for language models, where Moonlight shape scaling is combined with one-time per-layer calibration so every trust ratio starts at one. We analyze why three natural Muon-LAMB hybrids fail through shape-degenerate denominators, raw-momentum clip saturation, and decoupled weight-decay runaway, and show that the real-update-direction denominator with coupled weight decay avoids these failures. Theoretically, OrScale admits an O(1/sqrt(T)) nonconvex convergence guarantee in a nuclear-norm criterion, a strict layer-adaptive descent gain under measurable layer heterogeneity, and calibration properties that preserve muP-style learning-rate transfer at initialization. Empirically, OrScale ranks first on CIFAR-10/DavidNet across three seeds, improving Muon from 93.70% to 94.05% validation top-1, and OrScale-LM improves FineWeb-Edu pre-training versus Muon+Moonlight at three of four scales from 125M to 1.1B parameters while outperforming AdamW at every scale.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes