AIMar 15

Data Darwinism Part II: DataEvolve -- AI can Autonomously Evolve Pretraining Data Curation

arXiv:2603.1442098.3h-index: 7
Predicted impact top 4% in AI · last 90 daysOriginality Highly original
AI Analysis

This addresses the prohibitive cost of manual strategy design for large-scale, heterogeneous pretraining data, enabling more efficient and effective data curation for AI model training.

The paper tackles the problem of manually designing data curation strategies for large-scale pretraining corpora by introducing DataEvolve, a framework that autonomously evolves strategies through iterative optimization. The result is Darwin-CC, a 504B-token dataset that, when used to train 3B models, outperforms raw data by 3.96 points and achieves a 44.13 average score across 18 benchmarks, surpassing existing datasets like DCLM and Ultra-FineWeb.

Data Darwinism (Part I) established a ten-level hierarchy for data processing, showing that stronger processing can unlock greater data value. However, that work relied on manually designed strategies for a single category. Modern pretraining corpora comprise hundreds of heterogeneous categories spanning domains and content types, each demanding specialized treatment. At this scale, manual strategy design becomes prohibitive. This raises a key question: can strategies evolve in an automated way? We introduce DataEvolve, a framework that enables strategies to evolve through iterative optimization rather than manual design. For each data category, DataEvolve operates in a closed evolutionary loop: it identifies quality issues, generates candidate strategies, executes them on sampled data, evaluates results, and refines approaches across generations. The process accumulates knowledge through an experience pool of discovered issues and a strategy pool tracking performance across iterations. Applied to 8 categories spanning 672B tokens from Nemotron-CC, DataEvolve produces Darwin-CC, a 504B-token dataset with strategies evolved through 30 iterations per category. Training 3B models on 500B tokens, Darwin-CC outperforms raw data (+3.96 points) and achieves a 44.13 average score across 18 benchmarks, surpassing DCLM, Ultra-FineWeb, and FineWeb-Edu, with strong gains on knowledge-intensive tasks such as MMLU. Analysis shows evolved strategies converge on cleaning-focused approaches: targeted noise removal and format normalization with domain-aware preservation, echoing the L4 (Generative Refinement) principles from Part I. Ablation studies confirm iterative evolution is essential: optimized strategies outperform suboptimal ones by 2.93 points, establishing evolutionary strategy design as feasible and necessary for pretraining-scale data curation.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes