CL AIMar 15

Drift and selection in LLM text ecosystems

arXiv:2604.0855428.9h-index: 2

AI Analysis

This addresses the recursive quality degradation of training data for AI systems, with implications for corpus design, though it is incremental as it builds on existing n-gram models.

The paper tackles the problem of how AI-generated text entering the public record affects future learning cycles, developing a mathematical framework that shows unfiltered reuse leads to shallow text distributions, while normative filtering preserves richer structure with an optimal upper bound on divergence from shallow equilibria.

The public text record -- the material from which both people and AI systems now learn -- is increasingly shaped by its own outputs. Generated text enters the public record, later agents learn from it, and the cycle repeats. Here we develop an exactly solvable mathematical framework for this recursive process, based on variable-order $n$-gram agents, and separate two forces acting on the public corpus. The first is drift: unfiltered reuse progressively removes rare forms, and in the infinite-corpus limit we characterise the stable distributions exactly. The second is selection: publication, ranking and verification filter what enters the record, and the outcome depends on what is selected. When publication merely reflects the statistical status quo, the corpus converges to a shallow state in which further lookahead brings no benefit. When publication is normative -- rewarding quality, correctness or novelty -- deeper structure persists, and we establish an optimal upper bound on the resulting divergence from shallow equilibria. The framework therefore identifies when recursive publication compresses public text and when selective filtering sustains richer structure, with implications for the design of AI training corpora.

View on arXiv PDF

Similar