Reconciling Contradictory Views on the Effectiveness of SFT in LLMs: An Interaction Perspective
For LLM practitioners, this work clarifies the mechanism behind SFT's inconsistent effectiveness and provides actionable insights for training, though it is incremental as it applies existing interaction-based explanations to a known problem.
This paper explains why supervised fine-tuning (SFT) is effective for small models but inconsistent for large language models (LLMs), finding that SFT primarily removes noise-like interactions and that this beneficial denoising stage is extremely brief, after which overfitting occurs. The findings offer practical guidance for early stopping in LLM training.
This paper explores a scientific question in supervised fine-tuning (SFT): why SFT is broadly effective for small-scale deep neural networks, yet can produce inconsistent or even detrimental effects when applied to large language models (LLMs). Recent advances in interaction-based explanations suggest that interactions between words/tokens provide a faithful metric for quantifying the inference patterns encoded by LLMs. We find that the evolution of interactions during SFT can effectively explain the inconsistent effectiveness of SFT for LLMs. Specifically, we find that (1) SFT primarily removes noise-like interactions, while rarely acquiring reliable new interactions. (2) This denoising stage is extremely brief, after which continued fine-tuning tends to introduce overfitted interactions. We validate these findings across multiple LLMs and datasets. Our findings provide new insights into early stopping and offer practical guidance for LLM training.