An Information-Theoretic Criterion for Efficient Data Synthesis

arXiv:2605.1637980.6

AI Analysis

For researchers and practitioners working with synthetic data for LLM training, this work offers a theoretical framework to predict when synthetic data will be beneficial or harmful.

The paper provides an information-theoretic explanation for why synthetic data improves LLMs only when the generation-training loop is information-open (shaped by external signals), and shows that information-closed loops lead to collapse. It proposes that learning converges to the most information-efficient signal component, explaining both efficiency and reward hacking.

Synthetic data becomes crucial for large language model training, but its effectiveness is highly inconsistent. We provide an information-theoretic account of this inconsistency: synthetic data improves a model only when the generation-training loop is information-open, i.e., shaped by external signals (verifiers, environments, or rubrics) that inject task-relevant information beyond the model's current distribution. When the loop is information-closed (relying on the model's own outputs without such signals), the data processing inequality ensures that task-relevant information can only decrease, making collapse a predicted outcome. Among information-open pipelines, both efficiency and generalization hinge on the meta-level of supervision: a coarser signal such as binary correctness treats all acceptable outputs as equivalent, so the behavior it teaches is not tied to any particular domain or surface form and generalizes naturally across tasks and domains. These observations lead to a guiding thesis: learning preferentially converges to the most information-efficient signal component available, which accelerates learning when that component is the intended one, but causes reward hacking when a spurious pattern happens to be simpler.

View on arXiv PDF

Similar