LGAICLMay 7, 2025

When Bad Data Leads to Good Models

Harvard
arXiv:2505.04741v19 citationsh-index: 10ICML
Originality Highly original
AI Analysis

This work addresses the challenge of balancing toxicity reduction and model performance in LLM development, offering a novel co-design approach that could impact AI safety practices.

The paper tackles the problem of data quality in LLM pretraining by showing that pre-training on more toxic data can improve post-training control, reducing output toxicity while preserving capabilities. In experiments with Olmo-1B models, they found that toxic data makes toxicity easier to remove, achieving a better trade-off with detoxification techniques like ITI, as demonstrated on Toxigen and Real Toxicity Prompts benchmarks.

In large language model (LLM) pretraining, data quality is believed to determine model quality. In this paper, we re-examine the notion of "quality" from the perspective of pre- and post-training co-design. Specifically, we explore the possibility that pre-training on more toxic data can lead to better control in post-training, ultimately decreasing a model's output toxicity. First, we use a toy experiment to study how data composition affects the geometry of features in the representation space. Next, through controlled experiments with Olmo-1B models trained on varying ratios of clean and toxic data, we find that the concept of toxicity enjoys a less entangled linear representation as the proportion of toxic data increases. Furthermore, we show that although toxic data increases the generational toxicity of the base model, it also makes the toxicity easier to remove. Evaluations on Toxigen and Real Toxicity Prompts demonstrate that models trained on toxic data achieve a better trade-off between reducing generational toxicity and preserving general capabilities when detoxifying techniques such as inference-time intervention (ITI) are applied. Our findings suggest that, with post-training taken into account, bad data may lead to good models.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes