BiMix: A Bivariate Data Mixing Law for Language Model Pretraining
This work addresses the challenge of data mixing for LLM developers, offering theoretical insights and practical tools to enhance training efficiency, though it is incremental as it builds on existing scaling laws.
The paper tackles the problem of understanding and optimizing pretraining data composition for large language models by introducing BiMix, a bivariate data mixing law that models domain proportions and data volume, resulting in high accuracy in loss extrapolation (mean relative error < 0.2%) and superior model performance compared to existing methods.
Large language models have demonstrated remarkable capabilities across various tasks, primarily attributed to the utilization of diversely sourced data. However, the impact of pretraining data composition on model performance remains poorly understood. This paper introduces $\textbf{BiMix}$, a novel bivariate data mixing law that models the joint scaling behavior of domain proportions and data volume in LLM pretraining. $\textbf{BiMix}$ provides a systematic framework for understanding and optimizing data mixtures across diverse domains. Through extensive experiments on two large-scale datasets, we demonstrate $\textbf{BiMix}$'s high accuracy in loss extrapolation (mean relative error < 0.2%) and its generalization to unseen mixtures (R${}^{2}$ > 0.97). Optimization of domain proportions yields superior model performance compared to existing methods. Furthermore, we establish entropy-based measures as efficient proxies for data mixing, offering a computationally lightweight strategy. Our work contributes both theoretical insights into data mixing dynamics and practical tools for enhancing LLM training efficiency, paving the way for more effective scaling strategies in language model development.