LGAIApr 21, 2025

Compute-Optimal LLMs Provably Generalize Better With Scale

arXiv:2504.15208v16 citationsh-index: 18ICLR
Originality Incremental advance
AI Analysis

This provides theoretical insights into scaling laws for AI researchers, though it is incremental as it builds on existing Chinchilla scaling laws.

The paper tackles the problem of why larger language models generalize better by developing generalization bounds for compute-optimal LLMs, showing that as models scale up, the generalization gap decreases due to reduced loss variance and quantization error.

Why do larger language models generalize better? To investigate this question, we develop generalization bounds on the pretraining objective of large language models (LLMs) in the compute-optimal regime, as described by the Chinchilla scaling laws. We introduce a novel, fully empirical Freedman-type martingale concentration inequality that tightens existing bounds by accounting for the variance of the loss function. This generalization bound can be decomposed into three interpretable components: the number of parameters per token, the loss variance, and the quantization error at a fixed bitrate. As compute-optimal language models are scaled up, the number of parameters per data point remains constant; however, both the loss variance and the quantization error decrease, implying that larger models should have smaller generalization gaps. We examine why larger models tend to be more quantizable from an information theoretic perspective, showing that the rate at which they can integrate new information grows more slowly than their capacity on the compute-optimal frontier. From these findings we produce a scaling law for the generalization gap, with bounds that become predictably stronger with scale.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes