LG AIApr 13

Spectral Entropy Collapse as an Empirical Signature of Delayed Generalisation in Grokking

Truong Xuan Khanh, Truong Quynh Hoa, Luu Duc Trung, Phan Thanh Duc

arXiv:2604.1312313.71 citationsh-index: 1

Predicted impact top 53% in LG · last 90 daysOriginality Highly original

AI Analysis

Provides a predictive mechanistic signature for grokking in transformers, addressing a key open problem in understanding delayed generalization.

The paper identifies normalized spectral entropy of representation covariance as a scalar order parameter predicting grokking in transformers, with a threshold at ~0.61 leading generalization by ~1,020 steps; causal intervention delaying collapse delays grokking by ~5,020 steps.

Grokking -- delayed generalisation long after memorisation -- lacks a predictive mechanistic explanation. We identify the normalised spectral entropy $\tilde{H}(t)$ of the representation covariance as a scalar order parameter for this transition, validated on 1-layer Transformers on group-theoretic tasks. Five contributions: (i) Grokking follows a two-phase pattern: norm expansion then entropy collapse. (ii) $\tilde{H}$ crosses a stable threshold $\tilde{H}^* \approx 0.61$ before generalisation in 100% of runs (mean lead: 1,020 steps). (iii) A causal intervention preventing collapse delays grokking by +5,020 steps ($p=0.044$); a norm-matched control ($n=30$, $p=5\times10^{-5}$) confirms entropy -- not norm -- drives the transition. (iv) A power-law $ΔT = C_1(\tilde{H}-\tilde{H}^*)^γ+C_2$ ($R^2=0.543$) predicts grokking onset with 4.1% error. (v) The mechanism holds across abelian ($\mathbb{Z}/97\mathbb{Z}$) and non-abelian ($S_5$) groups. Crucially, MLPs show entropy collapse without grokking, proving collapse is necessary but not sufficient -- architecture matters. Code: https://anonymous.4open.science/r/grokking-entropy

View on arXiv PDF

Similar