LGAug 18, 2025

Learning In-context n-grams with Transformers: Sub-n-grams Are Near-stationary Points

arXiv:2508.12837v211 citationsh-index: 5ICML
Originality Incremental advance
AI Analysis

This provides theoretical insight into training dynamics for researchers in machine learning, but it is incremental as it builds on known empirical observations.

The paper investigates the loss landscape of transformer models trained on in-context next-token prediction tasks, establishing that sub-n-grams are near-stationary points of the population cross-entropy loss, which explains observed stage-wise learning dynamics and emergent phase transitions.

Motivated by empirical observations of prolonged plateaus and stage-wise progression during training, we investigate the loss landscape of transformer models trained on in-context next-token prediction tasks. In particular, we focus on learning in-context $n$-gram language models under cross-entropy loss, and establish a sufficient condition for parameter configurations to be stationary points. We then construct a set of parameter configurations for a simplified transformer model that represent $k$-gram estimators (for $k \leq n$), and show that the gradient of the population loss at these solutions vanishes in the limit of infinite sequence length and parameter norm. This reveals a key property of the loss landscape: {sub-$n$-grams are near-stationary points of the population cross-entropy loss}, offering theoretical insight into widely observed phenomena such as stage-wise learning dynamics and emergent phase transitions. These insights are further supported by numerical experiments that illustrate the learning dynamics of $n$-grams, characterized by discrete transitions between near-stationary solutions.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes