LGMay 9

Practical Scaling Laws: Converting Compute into Performance in a Data-Constrained World

arXiv:2605.0918965.2
AI Analysis

For practitioners training models under data constraints, this provides a more accurate scaling law that accounts for overfitting and multi-epoch training, enabling better compute allocation.

The authors propose a closed-form extension to Chinchilla scaling laws that saturates at an uninformed baseline in data-constrained regimes and decomposes loss into undercapacity, undertraining, and overfitting terms. Validated on four architecture families and five LLM grids, it achieves state-of-the-art RMSE and enables cost-aware compute allocation.

The scaling laws guiding modern model training were calibrated for a single regime: data-rich, single-epoch pretraining. The dominant such scaling law form, Chinchilla's $L = E + A/N^α+ B/D^β$, has three structural limitations outside that regime: it diverges as unique data shrinks instead of saturating at the uninformed baseline; it cannot represent overfitting when capacity exceeds the data; and it conflates total examples seen with unique examples available. We propose a closed-form extension, $L(N, D, T) = E + (L_0 - E)\,h/(1+h)$ with $h = a/N^α+ b/T^β+ c\,N^γ/D^δ$, that decomposes loss into undercapacity, undertraining, and overfitting terms. It saturates between the irreducible loss $E$ and an uninformed baseline $L_0$ fixed by the loss type, and reduces to Chinchilla in the data-rich, single-epoch limit. We validate it on four multi-epoch experiments spanning four architecture families (MLPs, ResNets, Fourier neural operators, and transformers) across vision, scientific ML, and language domains, and refit it to five published LLM scaling-law grids. Extrapolating to higher compute and larger unique data than seen at fit time, our form achieves state-of-the-art RMSE on every published LLM grid we evaluate and on most cells of our constructed experiments. Once calibrated, the form admits a cost-aware allocation that recovers Chinchilla's optimum when data is free and shifts toward smaller corpora and more epochs as data grows expensive.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes