LG MLJul 17, 2024

Why Do You Grok? A Theoretical Analysis of Grokking Modular Addition

Mohamad Amin Mohamadi, Zhiyuan Li, Lei Wu, Danica J. Sutherland

arXiv:2407.12332v120.732 citationsh-index: 7

Originality Highly original

AI Analysis

This addresses the fundamental problem of understanding generalization dynamics in deep learning for researchers, offering a theoretical explanation for an observed but poorly understood phenomenon.

The paper tackles the grokking phenomenon in neural networks, where models generalize long after overfitting, by providing a theoretical analysis for modular addition. It shows that models initially cannot generalize without extensive data but later escape this regime, achieving generalization with fewer points, supported by empirical evidence on Transformers.

We present a theoretical explanation of the ``grokking'' phenomenon, where a model generalizes long after overfitting,for the originally-studied problem of modular addition. First, we show that early in gradient descent, when the ``kernel regime'' approximately holds, no permutation-equivariant model can achieve small population error on modular addition unless it sees at least a constant fraction of all possible data points. Eventually, however, models escape the kernel regime. We show that two-layer quadratic networks that achieve zero training loss with bounded $\ell_{\infty}$ norm generalize well with substantially fewer training points, and further show such networks exist and can be found by gradient descent with small $\ell_{\infty}$ regularization. We further provide empirical evidence that these networks as well as simple Transformers, leave the kernel regime only after initially overfitting. Taken together, our results strongly support the case for grokking as a consequence of the transition from kernel-like behavior to limiting behavior of gradient descent on deep networks.

View on arXiv PDF

Similar