CLAILGDec 19, 2022

Training Trajectories of Language Models Across Scales

CMUPrincetonUW
arXiv:2212.09803v3257 citationsh-index: 116
Originality Synthesis-oriented
AI Analysis

This provides insights into language model scaling for researchers, but it is incremental as it builds on existing OPT models and focuses on analysis rather than new methods.

The paper analyzed training dynamics of OPT models from 125M to 175B parameters, finding that perplexity is a stronger predictor of model behaviors than size or training computation, with specific patterns in loss reduction and learning of grammatical sequences.

Scaling up language models has led to unprecedented performance gains, but little is understood about how the training dynamics change as models get larger. How do language models of different sizes learn during pre-training? Why do larger language models demonstrate more desirable behaviors? In this paper, we analyze the intermediate training checkpoints of differently sized OPT models (Zhang et al.,2022)--from 125M to 175B parameters--on next-token prediction, sequence-level generation, and downstream tasks. We find that 1) at a given perplexity and independent of model sizes, a similar subset of training tokens see the most significant reduction in loss, with the rest stagnating or showing double-descent behavior; 2) early in training, all models learn to reduce the perplexity of grammatical sequences that contain hallucinations, with small models halting at this suboptimal distribution and larger ones eventually learning to assign these sequences lower probabilities; 3) perplexity is a strong predictor of in-context learning performance on 74 multiple-choice tasks from BIG-Bench, and this holds independent of the model size. Together, these results show that perplexity is more predictive of model behaviors than model size or training computation.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes