LG CLMay 21, 2025

Mechanistic Insights into Grokking from the Embedding Layer

H. V. AlquBoj, Hilal AlQuabeh, Velibor Bojkovic, Munachiso Nwadike, Kentaro Inui

arXiv:2505.15624v19.44 citationsh-index: 36

Originality Highly original

AI Analysis

This work addresses the underexplored mechanisms of grokking for researchers in neural network training, offering insights into embedding dynamics and optimization, though it is incremental in building on prior observations of grokking.

The paper tackled the problem of grokking, a delayed generalization in neural networks, by showing that embeddings are central to this phenomenon in modular arithmetic tasks, with MLPs without embeddings generalizing immediately. The result included a proven adaptive learning rate ratio that mitigates bilinear coupling effects, accelerating convergence and extending to Transformer optimization challenges.

Grokking, a delayed generalization in neural networks after perfect training performance, has been observed in Transformers and MLPs, but the components driving it remain underexplored. We show that embeddings are central to grokking: introducing them into MLPs induces delayed generalization in modular arithmetic tasks, whereas MLPs without embeddings can generalize immediately. Our analysis identifies two key mechanisms: (1) Embedding update dynamics, where rare tokens stagnate due to sparse gradient updates and weight decay, and (2) Bilinear coupling, where the interaction between embeddings and downstream weights introduces saddle points and increases sensitivity to initialization. To confirm these mechanisms, we investigate frequency-aware sampling, which balances token updates by minimizing gradient variance, and embedding-specific learning rates, derived from the asymmetric curvature of the bilinear loss landscape. We prove that an adaptive learning rate ratio, \(\frac{η_E}{η_W} \propto \frac{σ_{\max}(E)}{σ_{\max}(W)} \cdot \frac{f_W}{f_E}\), mitigates bilinear coupling effects, accelerating convergence. Our methods not only improve grokking dynamics but also extend to broader challenges in Transformer optimization, where bilinear interactions hinder efficient training.

View on arXiv PDF

Similar