LGMay 22

Spectral Asymptotics of Neural Network Loss Landscapes: An Exact Decomposition of the Curvature Exponent

arXiv:2606.0259618.8

AI Analysis

For deep learning theorists and practitioners, this provides a principled geometric explanation for observed curvature variations and a practical architecture-adaptive optimizer.

The paper proves a Spectral Alignment Decomposition that explains why the curvature exponent α varies across layer types (e.g., α≈2 for convolutions, ≈1 for attention), and derives a spectral transfer identity linking α, gradient rank-decay γ, and Hessian decay exponent s, recovering s to ~2% median error across 93 layers with no free parameters. As a proof of concept, they introduce Spectral Newton, which outperforms AdamW on vision benchmarks where α≈2.

The curvature exponent $α$ in $h_k \propto σ_k^α$ -- governing how Hessian eigenvalues scale with gradient singular values -- varies systematically across layer types ($α\approx 2$ for convolutions, $\approx 1$ for transformer attention, $< 1$ for MLP up-projections). Why? We prove the Spectral Alignment Decomposition: $α= 2 + d\logΦ_k / d\logσ_k$, where $Φ_k$ measures alignment between Kronecker factor eigenbases and gradient singular directions. This reduces "why does $α$ vary?" to a geometric question we answer for LayerNorm, residual connections, and softmax heads. The decomposition implies a spectral transfer identity $s = αγ$ linking curvature exponent, effective gradient rank-decay $γ$, and Hessian decay exponent $s$. The identity is algebraic; its empirical content is that $α$ and $γ$, fit on independent data (HVPs vs. SVD), recover $s$ to ~2% median error across 93 layers, five architectures, and three datasets -- with no free parameters. A zeta-function bound on participation ratio shows curvature concentrates onto effectively one direction per layer. As a proof of concept, we derive the architecture-adaptive preconditioner $T(σ;α)$ and show that Spectral Newton -- implementing $T$ in the gradient singular basis -- outperforms AdamW on vision benchmarks where $α\approx 2$.

View on arXiv PDF

Similar