RL-Guided Data Selection for Language Model Finetuning
This addresses the challenge of efficient data selection for fine-tuning LLMs, which is crucial for resource-constrained applications, though it is an incremental improvement over existing approximate methods.
The paper tackles the problem of selecting optimal subsets of data for fine-tuning large language models under a strict training budget by framing it as a Markov Decision Process and using reinforcement learning to learn selection policies. The result shows that training on a 5% subset selected by their method matches or outperforms full-dataset fine-tuning by up to 10.8 accuracy points while reducing training time by up to 2 times.
Data selection for finetuning Large Language Models (LLMs) can be framed as a budget-constrained optimization problem: maximizing a model's downstream performance under a strict training data budget. Solving this problem is generally intractable, and existing approximate approaches are pretraining-oriented and transfer poorly to the fine-tuning setting. We reformulate this problem as a tractable Markov Decision Process (MDP) and train agents using various Reinforcement Learning (RL) methods to learn optimal data selection policies, guided by an efficient, proxy-model-based reward signal. Across four datasets, training on a $5\%$ subset selected by our approach matches or outperforms fine-tuning on the full dataset by up to $10.8$ accuracy points, while cutting wall-clock training time by up to $2 \times$, highlighting the promise of RL-guided data selection.