LGAIFeb 19, 2024

Towards Theoretical Understandings of Self-Consuming Generative Models

arXiv:2402.11778v226 citationsh-index: 7ICML
Originality Incremental advance
AI Analysis

This provides theoretical insights for researchers and practitioners in machine learning addressing the challenge of data degradation in iterative generative model training, though it is incremental as it builds on existing concerns about synthetic data quality.

This paper tackles the problem of training generative models in a self-consuming loop, where models are recursively trained on mixtures of real and synthetic data, by deriving theoretical bounds on the total variation distance between synthetic and real data distributions, showing it can be controlled with sufficient real data and revealing a phase transition where the distance initially increases then decreases beyond a threshold.

This paper tackles the emerging challenge of training generative models within a self-consuming loop, wherein successive generations of models are recursively trained on mixtures of real and synthetic data from previous generations. We construct a theoretical framework to rigorously evaluate how this training procedure impacts the data distributions learned by future models, including parametric and non-parametric models. Specifically, we derive bounds on the total variation (TV) distance between the synthetic data distributions produced by future models and the original real data distribution under various mixed training scenarios for diffusion models with a one-hidden-layer neural network score function. Our analysis demonstrates that this distance can be effectively controlled under the condition that mixed training dataset sizes or proportions of real data are large enough. Interestingly, we further unveil a phase transition induced by expanding synthetic data amounts, proving theoretically that while the TV distance exhibits an initial ascent, it declines beyond a threshold point. Finally, we present results for kernel density estimation, delivering nuanced insights such as the impact of mixed data training on error propagation.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes