MergeMix: Optimizing Mid-Training Data Mixtures via Learnable Model Merging
This provides an automated, scalable solution for researchers and practitioners optimizing data mixtures in LLMs, though it is incremental as it builds on model merging techniques.
The paper tackles the problem of optimizing data mixtures for large language models, which is computationally expensive with existing methods, by introducing MergeMix, a method that uses model merging weights as a low-cost proxy to achieve performance comparable to or better than manual tuning while reducing search costs, as validated on 8B and 16B parameter models with high rank consistency (Spearman ρ > 0.9).
Optimizing data mixtures is essential for unlocking the full potential of large language models (LLMs), yet identifying the optimal composition remains computationally prohibitive due to reliance on heuristic trials or expensive proxy training. To address this, we introduce \textbf{MergeMix}, a novel approach that efficiently determines optimal data mixing ratios by repurposing model merging weights as a high-fidelity, low-cost performance proxy. By training domain-specific experts on minimal tokens and optimizing their merging weights against downstream benchmarks, MergeMix effectively optimizes the performance of data mixtures without incurring the cost of full-scale training. Extensive experiments on models with 8B and 16B parameters validate that MergeMix achieves performance comparable to or surpassing exhaustive manual tuning while drastically reducing search costs. Furthermore, MergeMix exhibits high rank consistency (Spearman $ρ> 0.9$) and strong cross-scale transferability, offering a scalable, automated solution for data mixture optimization.