CLOct 1, 2025
Characterizing Model Behavior Under Synthetic Data Training: An Empirical Study Across Scales and Mixing RatiosY. Du, G. Wu, G. Tang et al.
Synthetic data generated by large language models has become integral to modern NLP training pipelines, from bootstrapping reasoning capabilities to augmenting instruction-following datasets. While recent work demonstrates successful applications maintaining high external data ratios, systematic understanding of how synthetic data proportion affects model behavior across different scales remains limited. This paper presents a controlled empirical study examining model performance, calibration, and output characteristics when trained on varying synthetic-to-external data ratios. Using the Pythia model suite (410M-12B parameters) across five diverse tasks, we evaluate models after one to three training iterations with synthetic data proportions ranging from 0-50\%. Our key findings include: models maintain stable performance with up to 20\% synthetic data, but degradation accelerates beyond 30\%; larger models (6.9B-12B) show greater robustness to synthetic data than smaller models (410M-1.4B); calibration degradation precedes accuracy loss, providing an early warning signal; and task characteristics matter, with reasoning tasks degrading faster than retrieval tasks under synthetic data training. Importantly, we find that current best practices, such as those employed in STaR and Self-Instruct systems that maintain greater than 80\% external data, operate well within safe regimes identified by our experiments. We provide practical guidance for practitioners on synthetic data budgets based on model scale and task requirements, alongside detailed comparison with concurrent work including Shumailov et al.'s model collapse findings.
LGJun 8, 2020
The Strength of Nesterov's Extrapolation in the Individual Convergence of Nonsmooth OptimizationW. Tao, Z. Pan, G. Wu et al.
The extrapolation strategy raised by Nesterov, which can accelerate the convergence rate of gradient descent methods by orders of magnitude when dealing with smooth convex objective, has led to tremendous success in training machine learning tasks. In this article, the convergence of individual iterates of projected subgradient (PSG) methods for nonsmooth convex optimization problems is theoretically studied based on Nesterov's extrapolation, which we name individual convergence. We prove that Nesterov's extrapolation has the strength to make the individual convergence of PSG optimal for nonsmooth problems. In light of this consideration, a direct modification of the subgradient evaluation suffices to achieve optimal individual convergence for strongly convex problems, which can be regarded as making an interesting step toward the open question about stochastic gradient descent (SGD) posed by Shamir. Furthermore, we give an extension of the derived algorithms to solve regularized learning tasks with nonsmooth losses in stochastic settings. Compared with other state-of-the-art nonsmooth methods, the derived algorithms can serve as an alternative to the basic SGD especially in coping with machine learning problems, where an individual output is needed to guarantee the regularization structure while keeping an optimal rate of convergence. Typically, our method is applicable as an efficient tool for solving large-scale $l$1-regularized hinge-loss learning problems. Several comparison experiments demonstrate that our individual output not only achieves an optimal convergence rate but also guarantees better sparsity than the averaged solution.