Unifying Grokking and Double Descent
This work aims to unify disparate observations in deep learning generalization, potentially benefiting researchers in machine learning theory, but it appears incremental as it builds on prior studies of grokking and double descent.
The authors tackled the problem of understanding generalization in deep learning by hypothesizing that grokking and double descent are instances of the same learning dynamics, and they demonstrated model-wise grokking for the first time.
A principled understanding of generalization in deep learning may require unifying disparate observations under a single conceptual framework. Previous work has studied \emph{grokking}, a training dynamic in which a sustained period of near-perfect training performance and near-chance test performance is eventually followed by generalization, as well as the superficially similar \emph{double descent}. These topics have so far been studied in isolation. We hypothesize that grokking and double descent can be understood as instances of the same learning dynamics within a framework of pattern learning speeds. We propose that this framework also applies when varying model capacity instead of optimization steps, and provide the first demonstration of model-wise grokking.