LGCLSep 10, 2025

Generative Data Refinement: Just Ask for Better Data

arXiv:2509.08653v27 citationsh-index: 3
Originality Incremental advance
AI Analysis

This addresses data scarcity and quality issues for frontier AI models, offering a practical solution to incorporate risky user-generated content, though it is an incremental improvement on existing data refinement methods.

The paper tackles the problem of data exhaustion for large models by introducing Generative Data Refinement (GDR), a framework that uses pretrained generative models to transform datasets with undesirable content into refined ones, showing it outperforms industry-grade anonymization and enables detoxification of unsafe datasets.

For a fixed parameter size, the capabilities of large models are primarily determined by the quality and quantity of its training data. Consequently, training datasets now grow faster than the rate at which new data is indexed on the web, leading to projected data exhaustion over the next decade. Much more data exists as user-generated content that is not publicly indexed, but incorporating such data comes with considerable risks, such as leaking private information and other undesirable content. We introduce a framework, Generative Data Refinement (GDR), for using pretrained generative models to transform a dataset with undesirable content into a refined dataset that is more suitable for training. Our experiments show that GDR can outperform industry-grade solutions for dataset anonymization, as well as enable direct detoxification of highly unsafe datasets. Moreover, we show that by generating synthetic data that is conditioned on each example in the real dataset, GDR's refined outputs naturally match the diversity of web scale datasets, and thereby avoid the often challenging task of generating diverse synthetic data via model prompting. The simplicity and effectiveness of GDR make it a powerful tool for scaling up the total stock of training data for frontier models.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes