LGAICLSep 29, 2025

Scaling with Collapse: Efficient and Predictable Training of LLM Families

arXiv:2509.25087v16 citationsh-index: 18
Originality Incremental advance
AI Analysis

This work addresses the challenge of efficient and predictable training for LLM developers, offering incremental improvements in optimization diagnostics and tuning.

The paper tackled the problem of whether training loss curves for large language model (LLM) families collapse onto a universal trajectory under practical scaling recipes, showing that collapse occurs when hyperparameters are set optimally according to scaling laws, enabling applications like early diagnosis of training issues and efficient hyperparameter tuning. They demonstrated this by training a competitive LLM family, Celerity, using these insights.

Effective LLM training relies on *consistency*, meaning that key quantities -- such as final losses and optimal hyperparameters -- scale predictably across model sizes. Qiu et al. (2025) recently showed that this consistency extends beyond scalars: whole training loss curves can *collapse* onto a universal trajectory after a simple normalization. What remains unclear is whether this phenomenon holds for LLM families trained under *practical scaling recipes*, where width, depth, learning rate, batch size, and weight decay are scaled jointly. We show that it does: loss curves collapse across scales precisely when optimization hyperparameters are set optimally for the given data budget, in accordance with recent empirical scaling laws. Collapse thus emerges as a signature of compute-efficient training. We demonstrate two applications at scale: (1) deviation-from-collapse provides a sensitive, early diagnostic of training pathologies, and (2) the predictability of collapsed curves enables early stopping in large-scale hyperparameter tuning. Finally, we train a competitive LLM family, *Celerity*, using these insights, highlighting collapse as an effective tool for developing efficient LLMs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes