LGMLFeb 14, 2025

MixMin: Finding Data Mixtures via Convex Minimization

DeepMindU of Toronto
arXiv:2502.10510v28 citationsh-index: 30ICML
Originality Incremental advance
AI Analysis

This addresses the challenge of optimizing data mixtures for practitioners in ML, particularly for large-scale training like language models, though it is incremental as it builds on existing bi-level optimization concepts.

The paper tackles the problem of finding optimal data mixtures for machine learning by formalizing it as a bi-level objective and developing MixMin, a gradient-based method that becomes convex for large model classes. MixMin improved data mixtures with less than 0.2% additional compute, achieving 1-5% relative improvements in negative log likelihood on language tasks and 0.03-0.15 gains in average precision for bioassay data.

Modern machine learning pipelines are increasingly combining and mixing data from diverse and disparate sources, e.g., pre-training large language models. Yet, finding the optimal data mixture is a challenging and open problem. We formalize this data mixing problem as a bi-level objective: the best mixture is the one that would lead to the best model for a downstream objective. Unfortunately, this objective is generally intractable. In this paper, we make the observation that the bi-level data mixing objective becomes convex as our model class becomes larger. We develop and study a gradient-based approach for optimizing this convex objective, which we call MixMin, and test it on language modeling and chemistry tasks. MixMin was the only method that uniformly improved the data mixture in all our experiments. With MixMin, we improved the data mixture using less than 0.2% additional compute for a pythia-410M model trained on 8.2B tokens, resulting between 1-5% relative improvement to negative log likelihood on PIQA, ARC Easy, SciQ, and OpenWebMath. Crucially, we found that MixMin mixtures for smaller models improved training of larger models, suggesting that MixMin mixtures may be scale-invariant. When mixing bioassay data to train an XGBoost model, we saw improvements to average precision scores of 0.03-0.15.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes