LGMay 29

A Pre-Training Analogue of Grokking in Language Models: Tracing Delayed Grammatical Generalization

Sherin Muckatira, Namrata Shivagunde, Vijeta Deshpande, Anna Rumshisky

arXiv:2606.0023076.7h-index: 5

Predicted impact top 18% in LG · last 90 daysOriginality Incremental advance

AI Analysis

For researchers studying neural network generalization and language model training dynamics, this work provides a method to analyze delayed generalization in LLM pre-training, though it is an incremental extension of grokking concepts to a new setting.

The paper introduces an exposure-based framework to study grokking-like delayed generalization in LLM pre-training, demonstrating across five grammatical phenomena that generalization occurs after initial fitting, with grammatical concept vectors becoming more predictive and higher-dimensional post-generalization.

Grokking, the phenomenon in which neural networks generalize long after fitting their training data, has been studied in supervised settings on many epochs. LLM pre-training instead involves next-token prediction over an unlabeled corpus, with limited data repetition and no explicit train/validation split. To address this, we propose an exposure-based framework that enables the study of grokking-like dynamics during LLM pre-training. We ground our evaluation in BLiMP minimal pairs, which provide controlled grammatical contrasts. For every BLiMP minimal pair, we identify a critical phrase, the smallest continuous span that captures the grammatical contrast and the phenomenon-relevant context. Examples whose critical phrase appears in the pre-training window are assigned to the proxy-train split; the remaining examples are assigned to the proxy-validation split. Across five grammatical phenomena, we observe delayed generalization. Analyzing pre-training checkpoints before and after generalization shows that grammatical concept vectors become more predictive of grammatical acceptability and occupy a higher-dimensional subspace after generalization. We also find that attention from the critical token to the relevant context token is concentrated in a small number of heads.

View on arXiv PDF

Similar