LGFeb 3

TMS: Trajectory-Mixed Supervision for Reward-Free, On-Policy SFT

Rana Muhammad Shahroz Khan, Zijie Liu, Zhen Tan, Charles Fleming, Tianlong Chen

arXiv:2602.03073v13.82 citationsh-index: 3

Originality Incremental advance

AI Analysis

This addresses the problem of catastrophic forgetting in SFT for LLM practitioners, offering a more efficient alternative to RL, though it is incremental as it builds on existing SFT and RL paradigms.

The paper tackles the trade-off between reinforcement learning (RL) and supervised fine-tuning (SFT) for large language models by introducing Trajectory-Mixed Supervision (TMS), a reward-free framework that approximates on-policy benefits to reduce catastrophic forgetting, resulting in improved accuracy-retention performance on benchmarks like MATH and GSM8K.

Reinforcement Learning (RL) and Supervised Fine-Tuning (SFT) are the two dominant paradigms for enhancing Large Language Model (LLM) performance on downstream tasks. While RL generally preserves broader model capabilities (retention) better than SFT, it comes with significant costs: complex reward engineering, instability, and expensive on-policy sampling. In contrast, SFT is efficient but brittle, often suffering from catastrophic forgetting due to $\textbf{Supervision Mismatch}$: the divergence between the model's evolving policy and static training labels. We address this trade-off with $\textbf{Trajectory-Mixed Supervision (TMS)}$, a reward-free framework that approximates the on-policy benefits of RL by creating a dynamic curriculum from the model's own historical checkpoints. TMS minimizes $\textit{Policy-Label Divergence (PLD)}$, preventing the mode collapse that drives forgetting in standard SFT. Experiments across reasoning (MATH, GSM8K) and instruction-following benchmarks demonstrate that TMS effectively shifts the accuracy--retention Pareto frontier. While RL remains the gold standard for retention, TMS significantly outperforms standard and iterative SFT, bridging the gap to RL without requiring reward models or verifiers. Mechanistic analysis confirms that PLD drift accurately predicts forgetting and that TMS successfully mitigates this drift.

View on arXiv PDF

Similar