CL LGSep 30, 2025

Convergence and Divergence of Language Models under Different Random Seeds

Finlay Fehlauer, Kyle Mahowald, Tiago Pimentel

arXiv:2509.26643v12 citationsh-index: 4EMNLP

Originality Incremental advance

AI Analysis

This work addresses the stability of learned distributions in language model training, which is an incremental insight for researchers and practitioners in NLP.

The paper investigates the convergence of language models trained under different random seeds, identifying a four-phase pattern and showing that larger models reconverge faster in later stages while smaller ones do not, with uneven convergence across linguistic categories such as frequent tokens converging faster than infrequent ones.

In this paper, we investigate the convergence of language models (LMs) trained under different random seeds, measuring convergence as the expected per-token Kullback--Leibler (KL) divergence across seeds. By comparing LM convergence as a function of model size and training checkpoint, we identify a four-phase convergence pattern: (i) an initial uniform phase, (ii) a sharp-convergence phase, (iii) a sharp-divergence phase, and (iv) a slow-reconvergence phase. Further, we observe that larger models reconverge faster in later training stages, while smaller models never actually reconverge; these results suggest that a certain model size may be necessary to learn stable distributions. Restricting our analysis to specific token frequencies or part-of-speech (PoS) tags further reveals that convergence is uneven across linguistic categories: frequent tokens and function words converge faster and more reliably than their counterparts (infrequent tokens and content words). Overall, our findings highlight factors that influence the stability of the learned distributions in model training.

View on arXiv PDF

Similar