LGAIMay 19

A Bitter Lesson for Data Filtering

arXiv:2605.1940793.4
AI Analysis

Challenges the common belief in data filtering for large model pretraining, suggesting that compute can compensate for data quality.

The study investigates data filtering for large model pretraining and finds that with sufficient compute, the best data filter is no filter, as large models benefit from low-quality and distractor data.

We investigate data filtering for large model pretraining via new scaling studies that target the high compute, data-scarce regime. In spite of an apparently common belief that filtering data to include only high-quality information is essential, our experiments suggest that with enough compute, the best data filter is no data filter. We find that sufficiently trained large parameter models not only tolerate low-quality and distractor data, but in fact benefit from nominally ``poor'' data.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes