CLNov 16, 2023

The Curious Decline of Linguistic Diversity: Training Language Models on Synthetic Text

arXiv:2311.09807v2119 citationsh-index: 57
Originality Incremental advance
AI Analysis

This addresses the problem of preserving linguistic richness in AI for researchers and practitioners, highlighting risks of recursive synthetic training, but it is incremental as it builds on existing concerns about synthetic data.

This study investigated the impact of training language models on synthetic data generated by predecessors, focusing on linguistic diversity rather than performance metrics, and found a consistent decrease in diversity across successive iterations, especially in creative tasks.

This study investigates the consequences of training language models on synthetic data generated by their predecessors, an increasingly prevalent practice given the prominence of powerful generative models. Diverging from the usual emphasis on performance metrics, we focus on the impact of this training methodology on linguistic diversity, especially when conducted recursively over time. To assess this, we adapt and develop a set of novel metrics targeting lexical, syntactic, and semantic diversity, applying them in recursive finetuning experiments across various natural language generation tasks in English. Our findings reveal a consistent decrease in the diversity of the model outputs through successive iterations, especially remarkable for tasks demanding high levels of creativity. This trend underscores the potential risks of training language models on synthetic text, particularly concerning the preservation of linguistic richness. Our study highlights the need for careful consideration of the long-term effects of such training approaches on the linguistic capabilities of language models.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes