LGApr 8

The Lifecycle of the Spectral Edge: From Gradient Learning to Weight-Decay Compression

arXiv:2604.0738050.11 citations

Predicted impact top 50% in LG · last 90 daysOriginality Incremental advance

AI Analysis

This work provides insights into the mechanisms of grokking in neural networks, which is an incremental advancement for understanding training dynamics in machine learning.

The authors investigated the spectral edge's role in grokking, decomposing it into gradient and weight-decay components in sequence tasks, finding a two-phase lifecycle where it transitions from functional to compression axis with high impact (e.g., >4000x more ablation-critical than random directions). They identified three universality classes predicted by gap flow equations and showed information is re-encoded, not lost, with nonlinear probes achieving high accuracy (MLP R²=0.99 vs. linear R²=0.86).

We decompose the spectral edge -- the dominant direction of the Gram matrix of parameter updates -- into its gradient and weight-decay components during grokking in two sequence tasks (Dyck-1 and SCAN). We find a sharp two-phase lifecycle: before grokking the edge is gradient-driven and functionally active; at grokking, gradient and weight decay align, and the edge becomes a compression axis that is perturbation-flat yet ablation-critical (>4000x more impactful than random directions). Three universality classes emerge (functional, mixed, compression), predicted by the gap flow equation. Nonlinear probes show information is re-encoded, not lost (MLP $R^2=0.99$ where linear $R^2=0.86$), and removing weight decay post-grok reverses compression while preserving the algorithm.

View on arXiv PDF

Similar