Grokking vs. Learning: Same Features, Different Encodings
This work addresses the understanding of learning dynamics in machine learning, particularly for researchers studying generalization and model efficiency, but it is incremental as it builds on existing grokking concepts.
The study investigated whether grokking and ordinary learning produce fundamentally different models by comparing their features, compressibility, and dynamics, finding that both learn the same features but differ in encoding efficiency, with steady training achieving up to 25x compression compared to grokking's 5x.
Grokking typically achieves similar loss to ordinary, "steady", learning. We ask whether these different learning paths - grokking versus ordinary training - lead to fundamental differences in the learned models. To do so we compare the features, compressibility, and learning dynamics of models trained via each path in two tasks. We find that grokked and steadily trained models learn the same features, but there can be large differences in the efficiency with which these features are encoded. In particular, we find a novel "compressive regime" of steady training in which there emerges a linear trade-off between model loss and compressibility, and which is absent in grokking. In this regime, we can achieve compression factors 25x times the base model, and 5x times the compression achieved in grokking. We then track how model features and compressibility develop through training. We show that model development in grokking is task-dependent, and that peak compressibility is achieved immediately after the grokking plateau. Finally, novel information-geometric measures are introduced which demonstrate that models undergoing grokking follow a straight path in information space.