LGAug 7, 2025

On the Generalization of SFT: A Reinforcement Learning Perspective with Reward Rectification

arXiv:2508.05629v294 citationsh-index: 12Has Code
Originality Incremental advance
AI Analysis

This work addresses a key problem in fine-tuning LLMs for researchers and practitioners by offering a simple, theoretically motivated improvement that bridges SFT and RL, though it appears incremental as it modifies an existing method.

The paper tackled the limited generalization of Supervised Fine-Tuning (SFT) for Large Language Models by proposing Dynamic Fine-Tuning (DFT), which dynamically rescales the objective function to rectify reward structures, resulting in significantly outperforming standard SFT across multiple benchmarks and base models with improved generalization.

We present a simple yet theoretically motivated improvement to Supervised Fine-Tuning (SFT) for the Large Language Model (LLM), addressing its limited generalization compared to reinforcement learning (RL). Through mathematical analysis, we reveal that standard SFT gradients implicitly encode a problematic reward structure that may severely restrict the generalization capabilities of model. To rectify this, we propose Dynamic Fine-Tuning (DFT), stabilizing gradient updates for each token by dynamically rescaling the objective function with the probability of this token. Remarkably, this single-line code change significantly outperforms standard SFT across multiple challenging benchmarks and base models, demonstrating greatly improved generalization. Additionally, our approach shows competitive results in offline RL settings, offering an effective yet simpler alternative. This work bridges theoretical insight and practical solutions, substantially advancing SFT performance. The code will be available at https://github.com/yongliang-wu/DFT.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes