AISep 8, 2025

Accelerate Scaling of LLM Finetuning via Quantifying the Coverage and Depth of Instruction Set

Chengwei Wu, Li Du, Hanyu Zhao, Yiming Ju, Jiapu Wang, Tianyu Chen, Haoyi Zhou

arXiv:2509.06463v23.3h-index: 4

Originality Highly original

AI Analysis

This work addresses the challenge of optimizing data selection for efficient LLM fine-tuning, offering a practical solution for researchers and practitioners, though it is incremental in building on existing data selection concepts.

The paper tackles the problem of inefficient scaling in LLM fine-tuning by identifying semantic coverage and information depth as key dataset properties, and proposes the Information Landscape Approximation (ILA) framework for data selection, which achieves faster and more sustained performance improvements across tasks and model sizes compared to existing methods.

Scaling the amount of data used for supervied fine-tuning(SFT) does not guarantee the proportional gains in model performance, highlighting a critical need to understand what makes training samples effective. This work identifies two fundamental dataset properties that govern SFT scalability: \textbf{semantic coverage}, or the breadth of task domains, and \textbf{information depth}, or the richness of individual examples. We demonstrate that simple proxies for these properties explain the majority of validation loss variance in our experiments. In this work, we further propose the \textbf{Information Landscape Approximation (ILA)}, a model-agnostic data selection framework that jointly optimizes for these two factors. ILA constructs compact subsets that approximate the informational value of large datasets. Empirical results show that models tuned on ILA-selected data achieve faster and more sustained performance improvements across diverse tasks and model sizes compared to existing methods, a phenomenon we term \textbf{accelerated scaling}.

View on arXiv PDF

Similar