LG AIJan 30

SPICE: Submodular Penalized Information-Conflict Selection for Efficient Large Language Model Training

Powei Chang, Jinpeng Zhang, Bowen Chen, Chenyu Wang, Chenlu Guo, Yixing Zhang, Yukang Gao, JianXiang Xiang, Yue Gao, Chaoqun Sun, Yiyi Chen, Dongying Kong

arXiv:2601.23155v11.41 citationsh-index: 5

Originality Highly original

AI Analysis

This addresses the high computational cost of training large language models, offering a more efficient method for instruction tuning, though it is incremental as it builds on existing information-based selection approaches.

The paper tackled the problem of inefficient data selection for large language model training by identifying gradient conflicts as a key bottleneck, and proposed SPICE, a conflict-aware selector that uses only 10% of the data to match or exceed full-data tuning across 8 benchmarks with models like LLaMA2-7B and Qwen2-7B.

Information-based data selection for instruction tuning is compelling: maximizing the log-determinant of the Fisher information yields a monotone submodular objective, enabling greedy algorithms to achieve a $(1-1/e)$ approximation under a cardinality budget. In practice, however, we identify alleviating gradient conflicts, misalignment between per-sample gradients, is a key factor that slows down the decay of marginal log-determinant information gains, thereby preventing significant loss of information. We formalize this via an $\varepsilon$-decomposition that quantifies the deviation from ideal submodularity as a function of conflict statistics, yielding data-dependent approximation factors that tighten as conflicts diminish. Guided by this analysis, we propose SPICE, a conflict-aware selector that maximizes information while penalizing misalignment, and that supports early stopping and proxy models for efficiency. Empirically, SPICE selects subsets with higher log-determinant information than original criteria, and these informational gains translate into performance improvements: across 8 benchmarks with LLaMA2-7B and Qwen2-7B, SPICE uses only 10% of the data, yet matches or exceeds 6 methods including full-data tuning. This achieves performance improvements with substantially lower training cost.

View on arXiv PDF

Similar