LGMay 28

The Long-Term Effects of Data Selection in LLM Fine-Tuning

Yuxin Yang, Aoxiong Zeng, Xiangquan Yang

arXiv:2605.3053780.3h-index: 7

AI Analysis

This work addresses a critical problem for practitioners and researchers involved in multi-stage LLM fine-tuning, highlighting the potential for short-sighted data selection to negatively impact long-term model performance and adaptability.

This paper investigates the long-term effects of data selection strategies in multi-stage LLM fine-tuning, finding that strategies optimized for immediate performance can hinder future adaptability. They formalize this as 'myopic selection' and propose a Long-Horizon Aware Selection (LHAS) objective to mitigate it.

Data selection is increasingly used to reduce the cost of large language model (LLM) fine-tuning, with recent methods prioritizing samples by current utility, diversity, quality, or influence. This paper studies a different question: when fine-tuning occurs over multiple stages, can selection strategies that look optimal now make the model less adaptable later? We introduce a long-horizon view of LLM data selection in which a selector is evaluated not only by immediate task performance, but also by future adaptation speed, forgetting, capability imbalance, and out-of-distribution robustness. We compare representative random, loss-based, gradient-based, diversity-based, quality-based, and utility-diversity selection families under a unified multi-stage protocol. Through controlled experiments designed to instantiate this protocol, we show how short-term selectors can exhibit rank reversal: they improve the current stage while slowing subsequent learning and increasing forgetting. We formalize this behavior as \emph{myopic selection}, provide a simple local analysis of why it can occur, and propose a diagnostic Long-Horizon Aware Selection (LHAS) objective that augments immediate utility with coverage, future-proxy transfer, and anti-concentration terms. The study argues that data selection should be evaluated as a training intervention that shapes the model's learning trajectory, rather than only as a local data-efficiency mechanism.

View on arXiv PDF

Similar