CLMay 4

InfoLaw: Information Scaling Laws for Large Language Models with Quality-Weighted Mixture Data and Repetition

Fengze Liu, Weidong Zhou, Binbin Liu, Ping Guo, Zijun Wang, Bingni Zhang, Yifan Zhang, Yifeng Yu, Xiaohuan Zhou, Taifeng Wang

arXiv:2605.0236489.8

Predicted impact top 53% in CL · last 90 daysOriginality Incremental advance

AI Analysis

For LLM practitioners, InfoLaw enables reliable extrapolation of data recipe performance under varying compute budgets and overtraining levels, solving the problem of optimal data selection at scale.

InfoLaw introduces a data-aware scaling framework that predicts LLM loss from consumed tokens, model size, data mixture weights, and repetition, achieving 0.15% mean and 0.96% max absolute error in loss prediction on unseen recipes up to 7B/425B tokens.

Upweighting high-quality data in LLM pretraining often improves performance, but in datalimited regimes, especially under overtraining, stronger upweighting increases repetition and can degrade performance. However, standard scaling laws do not reliably extrapolate across mixture recipes or under repetitions, making the selection for optimal data recipes at scaling underdetermined. To solve this, we introduce InfoLaw (Information Scaling Laws), a data-aware scaling framework that predicts loss from consumed tokens, model size, data mixture weights, and repetition. The key idea is to model pretraining as information accumulation, where quality controls information density and repetition induces scaledependent diminishing returns. We first collect the model performance after training on datasets that vary in scale, quality distribution, and repetition level. Then we build up the modeling for information so that information accurately predicts those model performance. InfoLaw predicts performance on unseen data recipes and larger scale runs (up to 7B, 425B tokens) with 0.15% mean and 0.96% max absolute error in loss, and it extrapolates reliably across overtraining levels, enabling efficient data-recipe selection under varying compute budgets.

View on arXiv PDF

Similar