ML LG ST MESep 26, 2025

Preventing Model Collapse Under Overparametrization: Optimal Mixing Ratios for Interpolation Learning and Ridge Regression

Anvit Garg, Sohom Bhattacharya, Pragya Sur

arXiv:2509.22341v15 citationsh-index: 2

Originality Incremental advance

AI Analysis

This addresses the problem of model degradation in iterative training for researchers in machine learning, offering theoretical insights but is incremental as it extends known results to broader settings.

The paper tackles model collapse in overparameterized linear regression by deriving optimal mixing ratios of real and synthetic data to minimize long-term prediction error, showing that the optimal real-data proportion converges to the reciprocal of the golden ratio for interpolation and is at least one-half for ridge regression.

Model collapse occurs when generative models degrade after repeatedly training on their own synthetic outputs. We study this effect in overparameterized linear regression in a setting where each iteration mixes fresh real labels with synthetic labels drawn from the model fitted in the previous iteration. We derive precise generalization error formulae for minimum-$\ell_2$-norm interpolation and ridge regression under this iterative scheme. Our analysis reveals intriguing properties of the optimal mixing weight that minimizes long-term prediction error and provably prevents model collapse. For instance, in the case of min-$\ell_2$-norm interpolation, we establish that the optimal real-data proportion converges to the reciprocal of the golden ratio for fairly general classes of covariate distributions. Previously, this property was known only for ordinary least squares, and additionally in low dimensions. For ridge regression, we further analyze two popular model classes -- the random-effects model and the spiked covariance model -- demonstrating how spectral geometry governs optimal weighting. In both cases, as well as for isotropic features, we uncover that the optimal mixing ratio should be at least one-half, reflecting the necessity of favoring real-data over synthetic. We validate our theoretical results with extensive simulations.

View on arXiv PDF

Similar