LGAICLFeb 10, 2024

A Tale of Tails: Model Collapse as a Change of Scaling Laws

Peking U
arXiv:2402.07043v2131 citationsh-index: 22ICML
Originality Incremental advance
AI Analysis

This addresses the problem of model collapse for AI developers and researchers as synthetic data becomes prevalent, with incremental theoretical and experimental contributions.

The paper investigates how scaling laws change when synthetic data is incorporated into training corpora, revealing various decay phenomena such as loss of scaling and skill un-learning, validated through experiments with transformers and Llama2.

As AI model size grows, neural scaling laws have become a crucial tool to predict the improvements of large models when increasing capacity and the size of original (human or natural) training data. Yet, the widespread use of popular models means that the ecosystem of online data and text will co-evolve to progressively contain increased amounts of synthesized data. In this paper we ask: How will the scaling laws change in the inevitable regime where synthetic data makes its way into the training corpus? Will future models, still improve, or be doomed to degenerate up to total (model) collapse? We develop a theoretical framework of model collapse through the lens of scaling laws. We discover a wide range of decay phenomena, analyzing loss of scaling, shifted scaling with number of generations, the ''un-learning" of skills, and grokking when mixing human and synthesized data. Our theory is validated by large-scale experiments with a transformer on an arithmetic task and text generation using the large language model Llama2.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes