CLMar 3, 2025

SampleMix: A Sample-wise Pre-training Data Mixing Strategey by Coordinating Data Quality and Diversity

arXiv:2503.01506v16 citationsh-index: 26EMNLP
Originality Incremental advance
AI Analysis

This addresses the issue of inefficient pretraining data mixing for large language models, offering a novel approach to improve model performance, though it is incremental as it builds on existing data mixing methods.

The paper tackled the problem of suboptimal data distribution in pretraining large language models by proposing SampleMix, a sample-wise data mixing strategy that coordinates data quality and diversity, resulting in surpassing existing domain-based methods across multiple downstream tasks and perplexity assessments while requiring 1.4x to 2.1x more training steps to achieve baseline performance.

Existing pretraining data mixing methods for large language models (LLMs) typically follow a domain-wise methodology, a top-down process that first determines domain weights and then performs uniform data sampling across each domain. However, these approaches neglect significant inter-domain overlaps and commonalities, failing to control the global diversity of the constructed training dataset. Further, uniform sampling within domains ignores fine-grained sample-specific features, potentially leading to suboptimal data distribution. To address these shortcomings, we propose a novel sample-wise data mixture approach based on a bottom-up paradigm. This method performs global cross-domain sampling by systematically evaluating the quality and diversity of each sample, thereby dynamically determining the optimal domain distribution. Comprehensive experiments across multiple downstream tasks and perplexity assessments demonstrate that SampleMix surpasses existing domain-based methods. Meanwhile, SampleMix requires 1.4x to 2.1x training steps to achieves the baselines' performance, highlighting the substantial potential of SampleMix to optimize pre-training data.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes