LGJul 26, 2025

What Can Grokking Teach Us About Learning Under Nonstationarity?

DeepMind
arXiv:2507.20057v16 citationsh-index: 75
Originality Incremental advance
AI Analysis

This addresses primacy bias in continual learning for AI systems, but it is incremental as it builds on known grokking dynamics.

The paper tackles the problem of neural networks' primacy bias in continual learning by proposing that feature-learning dynamics from grokking can help overwrite old features, and demonstrates that increasing the effective learning rate improves generalization in grokking, warm-starting, and reinforcement learning tasks.

In continual learning problems, it is often necessary to overwrite components of a neural network's learned representation in response to changes in the data stream; however, neural networks often exhibit \primacy bias, whereby early training data hinders the network's ability to generalize on later tasks. While feature-learning dynamics of nonstationary learning problems are not well studied, the emergence of feature-learning dynamics is known to drive the phenomenon of grokking, wherein neural networks initially memorize their training data and only later exhibit perfect generalization. This work conjectures that the same feature-learning dynamics which facilitate generalization in grokking also underlie the ability to overwrite previous learned features as well, and methods which accelerate grokking by facilitating feature-learning dynamics are promising candidates for addressing primacy bias in non-stationary learning problems. We then propose a straightforward method to induce feature-learning dynamics as needed throughout training by increasing the effective learning rate, i.e. the ratio between parameter and update norms. We show that this approach both facilitates feature-learning and improves generalization in a variety of settings, including grokking, warm-starting neural network training, and reinforcement learning tasks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes