LG AIMay 23, 2025

Next-token pretraining implies in-context learning

Paul M. Riechers, Henry R. Bigelow, Eric A. Alt, Adam Shai

arXiv:2505.18373v211.42 citationsh-index: 14

Originality Incremental advance

AI Analysis

This provides a fundamental, architecture-independent explanation for in-context learning in AI, addressing a key problem in understanding model capabilities, though it is incremental in refining existing theories.

The paper argues that in-context learning (ICL) predictably emerges from standard next-token pretraining, not as an exotic property, by establishing foundational principles for in-distribution ICL and showing how models adapt to context from non-ergodic sources. It verifies this with experiments on synthetic datasets, reproducing phenomena like phase transitions and power-law scaling, and demonstrates that in-context performance is mathematically tied to the pretraining task ensemble.

We argue that in-context learning (ICL) predictably arises from standard self-supervised next-token pretraining, rather than being an exotic emergent property. This work establishes the foundational principles of this emergence by focusing on in-distribution ICL, demonstrating how models necessarily adapt to context when trained on token sequences, especially from non-ergodic sources. Our information-theoretic framework precisely predicts these in-distribution ICL dynamics (i.e., context-dependent loss reduction). We verify this with experiments using synthetic datasets of differing types of correlational structure, reproducing characteristic phenomena like phase transitions in training loss for induction head formation and power-law scaling of in-context loss. We further show that a model's in-context performance on any task is mathematically coupled to the ensemble of tasks seen in pretraining, offering a fundamental explanation, grounded in architecture- and modality-independent principles, for such inference-time learning.

View on arXiv PDF

Similar