LGAIMLMay 28

On the Optimizer Dependence of Neural Scaling Laws

arXiv:2605.2938736.0h-index: 2
Predicted impact top 67% in LG · last 90 daysOriginality Incremental advance
AI Analysis

For researchers using scaling laws to predict model performance, this work shows that optimizer choice systematically affects the scaling exponent, challenging the common assumption of a fixed constant.

The paper demonstrates that the scaling exponent in neural scaling laws depends on the optimizer, with preconditioned optimizers yielding steeper scaling (e.g., natural gradient achieving α≈0.31 vs. gradient descent's α≈0.12 in random-feature regression). This implies optimizer choice should be accounted for in scaling-law forecasts.

The scaling exponent $α$ in neural scaling laws $L(N) \propto N^{-α}$ is commonly treated as a fixed constant set by architecture and data. We present evidence that $α$ depends systematically on the optimizer. In controlled random-feature regression experiments -- the canonical theoretical framework for neural scaling -- we measure $α$ across five optimizer variants and six spectral conditions. Preconditioned optimizers consistently yield steeper scaling (larger $α$), with the $α$-shift increasing across most of the tested spectral range, peaking near $s = 1.5$, and remaining large at $s = 2.0$. At $s \approx 1.0$ (characteristic of natural language), the full natural gradient achieves $α\approx 0.31$ versus $α\approx 0.12$ for gradient descent -- a $2.6\times$ larger fitted exponent that, within the random-feature model, compounds with each model-size doubling. Whether and how this exponent shift transfers to large-scale LLM training -- where recent evidence suggests the advantage may attenuate with scale -- remains an important open question. Our results imply that scaling-law forecasts should account for optimizer choice, and we provide a spectral diagnostic predicting when advanced optimizers will pay off.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes