LGAIMar 14

Spectral Edge Dynamics of Training Trajectories: Signal--Noise Geometry Across Scales

arXiv:2603.1567853.13 citationsh-index: 3
Predicted impact top 46% in LG · last 90 daysOriginality Incremental advance
AI Analysis

This provides a method for analyzing and predicting training dynamics in large models, which is incremental but useful for researchers in machine learning optimization.

The authors tackled the problem of understanding the low-dimensional structure in transformer training trajectories by introducing Spectral Edge Dynamics (SED), which reveals a universal three-phase pattern in the spectral edge and predicts generalization up to 1,700 steps early across benchmarks.

Despite hundreds of millions of parameters, transformer training trajectories evolve within only a few coherent directions. We introduce \emph{Spectral Edge Dynamics} (SED) to measure this structure: rolling-window SVD of parameter updates reveals a sharp boundary -- the \emph{spectral edge} -- between coherent optimization directions and stochastic noise, identified by the maximum consecutive singular value ratio $σ_k/σ_{k+1}$. Across a 51M-parameter TinyStories model (4~seeds) and GPT-2 124M under a distribution shift, the spectral edge exhibits a universal three-phase pattern (rise, plateau, collapse), signal rank adjusts with task complexity ($k^* = 2$ at 51M, $k^* = 3$ at 124M), and the directional coupling between spectral geometry and validation loss reverses with window size -- a \emph{lag flip} reflecting the timescale of trajectory integration. Johnson--Lindenstrauss projection to $d = 10W$ dimensions (e.g., $d = 100$ for $W = 10$) preserves the spectral gap within 5.7\%, making the framework applicable to models of arbitrary size. In companion work, the same spectral geometry provides early-warning signals of grokking -- predicting generalization 600--1{,}700 steps before it occurs across modular arithmetic, Dyck languages, and the SCAN benchmark.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes