CL AIJan 20, 2025

Optimizing Pretraining Data Mixtures with LLM-Estimated Utility

William Held, Bhargavi Paranjape, Punit Singh Koura, Mike Lewis, Frank Zhang, Todor Mihaylov

Georgia Tech

arXiv:2501.11747v217.614 citationsh-index: 24

Originality Incremental advance

AI Analysis

This work addresses the challenge of efficiently selecting training data for LLMs, offering incremental improvements in automation and compute efficiency for researchers and practitioners.

The paper tackles the problem of optimizing pretraining data mixtures for large language models by balancing quality, quantity, and diversity, resulting in methods like UtiliMax and MEDU that achieve up to a 10.6x speedup and reduce computational requirements by ~200x compared to baselines.

Large Language Models improve with increasing amounts of high-quality training data. However, leveraging larger datasets requires balancing quality, quantity, and diversity across sources. After evaluating nine baseline methods under both compute- and data-constrained scenarios, we find token-count heuristics outperform manual and learned mixes, indicating that simple approaches accounting for dataset size and diversity are surprisingly effective. Building on this insight, we propose two complementary approaches: UtiliMax, which extends token-based heuristics by incorporating utility estimates from reduced-scale ablations, achieving up to a 10.6x speedup over manual baselines; and Model Estimated Data Utility (MEDU), which leverages LLMs to estimate data utility from small samples, matching ablation-based performance while reducing computational requirements by $\sim$200x. Together, these approaches establish a new framework for automated, compute-efficient data mixing that is robust across training regimes.

View on arXiv PDF

Similar