A Bitter Lesson for Data Filtering
Challenges the common belief in data filtering for large model pretraining, suggesting that compute can compensate for data quality.
The study investigates data filtering for large model pretraining and finds that with sufficient compute, the best data filter is no filter, as large models benefit from low-quality and distractor data.
We investigate data filtering for large model pretraining via new scaling studies that target the high compute, data-scarce regime. In spite of an apparently common belief that filtering data to include only high-quality information is essential, our experiments suggest that with enough compute, the best data filter is no data filter. We find that sufficiently trained large parameter models not only tolerate low-quality and distractor data, but in fact benefit from nominally ``poor'' data.