Recursive Training Loops in LLMs: How training data properties modulate distribution shift in generated data?
This work addresses the problem of model collapse in LLMs for AI researchers and practitioners by providing empirical insights into how data properties modulate distribution shifts, though it is incremental in building on prior studies of recursive training effects.
The study investigates how properties of human training data, such as lexical and semantic diversity, influence the magnitude of distribution shifts in LLMs during recursive training loops, finding that lexical diversity amplifies shifts while semantic diversity and data quality mitigate them, with shifts varying across internet domains and affecting political bias amplification or reduction.
Large language models (LLMs) are increasingly used in the creation of online content, creating feedback loops as subsequent generations of models will be trained on this synthetic data. Such loops were shown to lead to distribution shifts - models misrepresenting the true underlying distributions of human data (also called model collapse). However, how human data properties affect such shifts remains poorly understood. In this paper, we provide the first empirical examination of the effect of such properties on the outcome of recursive training. We first confirm that using different human datasets leads to distribution shifts of different magnitudes. Through exhaustive manipulation of dataset properties combined with regression analyses, we then identify a set of properties predicting distribution shift magnitudes. Lexical diversity is found to amplify these shifts, while semantic diversity and data quality mitigate them. Furthermore, we find that these influences are highly modular: data scrapped from a given internet domain has little influence on the content generated for another domain. Finally, experiments on political bias reveal that human data properties affect whether the initial bias will be amplified or reduced. Overall, our results portray a novel view, where different parts of internet may undergo different types of distribution shift.