LGAIFeb 3

UniGeM: Unifying Data Mixing and Selection via Geometric Exploration and Mining

arXiv:2602.03772v1h-index: 2
Originality Highly original
AI Analysis

This addresses data curation bottlenecks for LLM training, offering a novel approach that is not incremental but provides specific gains in efficiency and performance.

The paper tackles the problem of data quality limiting Large Language Model scaling by introducing UniGeM, a framework that unifies data mixing and selection via geometric exploration and mining, achieving 2.0× data efficiency over a random baseline and improving performance in reasoning-heavy and multilingual evaluations.

The scaling of Large Language Models (LLMs) is increasingly limited by data quality. Most methods handle data mixing and sample selection separately, which can break the structure in code corpora. We introduce \textbf{UniGeM}, a framework that unifies mixing and selection by treating data curation as a \textit{manifold approximation} problem without training proxy models or relying on external reference datasets. UniGeM operates hierarchically: \textbf{Macro-Exploration} learns mixing weights with stability-based clustering; \textbf{Micro-Mining} filters high-quality instances by their geometric distribution to ensure logical consistency. Validated by training 8B and 16B MoE models on 100B tokens, UniGeM achieves \textbf{2.0$\times$ data efficiency} over a random baseline and further improves overall performance compared to SOTA methods in reasoning-heavy evaluations and multilingual generalization.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes