LGMLJun 16, 2025

What Happens During the Loss Plateau? Understanding Abrupt Learning in Transformers

arXiv:2506.13688v24 citationsh-index: 4
Originality Incremental advance
AI Analysis

This addresses the problem of understanding and potentially improving training dynamics in Transformers for researchers and practitioners, though it is incremental as it builds on known abrupt learning observations.

The paper investigates why Transformers on algorithmic tasks show abrupt learning with long plateaus followed by sharp improvements, revealing that during plateaus models develop partial solutions with repetition bias and representation collapse, and that slow learning of attention maps is a key bottleneck. It validates these phenomena in early pre-training of large language models like Pythia and OLMo.

Training Transformers on algorithmic tasks frequently demonstrates an intriguing abrupt learning phenomenon: an extended performance plateau followed by a sudden, sharp improvement. This work investigates the underlying mechanisms for such dynamics, primarily in shallow Transformers. We reveal that during the plateau, the model often develops an interpretable partial solution while simultaneously exhibiting a strong repetition bias in their outputs. This output degeneracy is accompanied by internal representation collapse, where hidden states across different tokens become nearly parallel. We further identify the slow learning of optimal attention maps as a key bottleneck. Hidden progress in attention configuration during the plateau precedes the eventual rapid convergence, and directly intervening on attention significantly alters plateau duration and the severity of repetition bias and representational collapse. We validate that these identified phenomena-repetition bias and representation collapse-are not artifacts of toy setups but also manifest in the early pre-training stage of large language models like Pythia and OLMo.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes