UniGeM: Unifying Data Mixing and Selection via Geometric Exploration and Mining
This addresses data curation bottlenecks for LLM training, offering a novel approach that is not incremental but provides specific gains in efficiency and performance.
The paper tackles the problem of data quality limiting Large Language Model scaling by introducing UniGeM, a framework that unifies data mixing and selection via geometric exploration and mining, achieving 2.0× data efficiency over a random baseline and improving performance in reasoning-heavy and multilingual evaluations.
The scaling of Large Language Models (LLMs) is increasingly limited by data quality. Most methods handle data mixing and sample selection separately, which can break the structure in code corpora. We introduce \textbf{UniGeM}, a framework that unifies mixing and selection by treating data curation as a \textit{manifold approximation} problem without training proxy models or relying on external reference datasets. UniGeM operates hierarchically: \textbf{Macro-Exploration} learns mixing weights with stability-based clustering; \textbf{Micro-Mining} filters high-quality instances by their geometric distribution to ensure logical consistency. Validated by training 8B and 16B MoE models on 100B tokens, UniGeM achieves \textbf{2.0$\times$ data efficiency} over a random baseline and further improves overall performance compared to SOTA methods in reasoning-heavy evaluations and multilingual generalization.