LGAIFeb 9

Grokking in Linear Models for Logistic Regression

arXiv:2602.08302v1
Originality Incremental advance
AI Analysis

This work addresses the fundamental understanding of grokking phenomena for machine learning researchers, revealing that it can occur in linear models, which is incremental as it extends prior deep learning-focused studies to simpler settings.

The paper tackled the problem of grokking, or delayed generalization, by studying it in a simple linear model with logistic loss for binary classification on linearly separable data, showing that grokking can emerge even without depth or representation learning through gradient descent dynamics, with theoretical and experimental validation of a three-phase learning process and characterization of grokking time.

Grokking, the phenomenon of delayed generalization, is often attributed to the depth and compositional structure of deep neural networks. We study grokking in one of the simplest possible settings: the learning of a linear model with logistic loss for binary classification on data that are linearly (and max margin) separable about the origin. We investigate three testing regimes: (1) test data drawn from the same distribution as the training data, in which case grokking is not observed; (2) test data concentrated around the margin, in which case grokking is observed; and (3) adversarial test data generated via projected gradient descent (PGD) attacks, in which case grokking is also observed. We theoretically show that the implicit bias of gradient descent induces a three-phase learning process-population-dominated, support-vector-dominated unlearning, and support-vector-dominated generalization-during which delayed generalization can arise. Our analysis further relates the emergence of grokking to asymmetries in the data, both in the number of examples per class and in the distribution of support vectors across classes, and yields a characterization of the grokking time. We experimentally validate our theory by planting different distributions of population points and support vectors, and by analyzing accuracy curves and hyperplane dynamics. Overall, our results demonstrate that grokking does not require depth or representation learning, and can emerge even in linear models through the dynamics of the bias term.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes