Grokking as Dimensional Phase Transition in Neural Networks

arXiv:2604.046556.72 citations

Predicted impact top 63% in LG · last 90 daysOriginality Highly original

AI Analysis

This provides new insight into the trainability of overparameterized networks, addressing a fundamental challenge in understanding learning dynamics for machine learning researchers.

The paper tackles the abrupt memorization-to-generalization transition (grokking) in neural networks by analyzing gradient avalanche dynamics, finding it is a dimensional phase transition where effective dimensionality crosses from sub-diffusive to super-diffusive at generalization onset, with robust results across eight model scales and topologies.

Neural network grokking -- the abrupt memorization-to-generalization transition -- challenges our understanding of learning dynamics. Through finite-size scaling of gradient avalanche dynamics across eight model scales, we find that grokking is a \textit{dimensional phase transition}: effective dimensionality~$D$ crosses from sub-diffusive (subcritical, $D < 1$) to super-diffusive (supercritical, $D > 1$) at generalization onset, exhibiting self-organized criticality (SOC). Crucially, $D$ reflects \textbf{gradient field geometry}, not network architecture: synthetic i.i.d.\ Gaussian gradients maintain $D \approx 1$ regardless of graph topology, while real training exhibits dimensional excess from backpropagation correlations. The grokking-localized $D(t)$ crossing -- robust across topologies -- offers new insight into the trainability of overparameterized networks.

View on arXiv PDF

Similar