LG AIMay 22

Feature Lottery? A Bifurcation Theory of Concept Emergence

arXiv:2605.2405729.9

Predicted impact top 81% in LG · last 90 daysOriginality Highly original

AI Analysis

This provides a practical, label-free early-warning indicator for training health in neural networks, detecting structural transitions before downstream metrics react.

The paper introduces a bifurcation theory to detect the onset of structured representations in neural networks in real-time, using a label-free phase coordinate β/β_c. It validates four transition regimes across diverse settings and shows that early atom purity in SAE training predicts final interpretability, with top-decile early atoms achieving over 12x baseline purity at convergence.

Neural networks acquire structured representations at specific moments during training, yet identifying these transitions typically relies on retrospective, label-dependent metrics. We introduce a bifurcation theory of representation dynamics to detect these moments in real time. Analyzing a passive GMM probe attached to the evolving encoder, we show the onset of structure corresponds to a supercritical pitchfork bifurcation driven by the loss Hessian. The system exhibits a theoretically predictable zero-crossing ($β_c$) that, compared to the network's current state ($β$), yields a dynamic ratio $β(t)/β_c(t)$: a universal, label-free phase coordinate for representation dynamics, computable entirely from hidden states. We empirically validate four distinct transition regimes predicted by this coordinate across diverse settings: SAEs on language models (Pythia), SSL (CIFAR), and grokking (modular arithmetic). Crucially, under finite dissipation, macroscopic symmetry-breaking can lag the initial zero-crossing by orders of magnitude, which providing a rigorous dynamical account of the delayed escape observed in grokking. Microscopically, the bifurcation creates a shared unstable subspace, forcing collective symmetry breaking. We term this the "feature lottery" in SAE training: a feature's terminal interpretability becomes predictable remarkably early. By only 5% of training, early atom purity robustly predicts final convergence purity, with top-decile early atoms achieving over 12x the baseline purity at convergence. Beyond explaining concept emergence, $β/β_c$ provides a practical early-warning indicator for training health, detecting the onset of usable structure, the crystallization of feature identity, and representational collapse epochs before downstream metrics react.

View on arXiv PDF

Similar