LG AIMay 21

Post-Training is About States, Not Tokens: A State Distribution View of SFT, RL, and On-Policy Distillation

arXiv:2605.2273157.6

Predicted impact top 40% in LG · last 90 daysOriginality Incremental advance

AI Analysis

For practitioners of LLM post-training, this work highlights that the choice of training states (on-policy vs. fixed dataset) can significantly impact model performance and retention, offering a complementary perspective to loss-centric analyses.

The paper argues that the state distribution (prompts plus generated prefixes) used during post-training is as important as the loss function, showing through experiments on Qwen3-0.6B-Base that on-policy distillation and RL can improve performance on GSM8K while preserving retention, unlike stress SFT which causes forgetting.

Large language model post-training methods such as supervised fine-tuning (SFT), reinforcement learning (RL), and distillation are often analyzed through their loss functions: maximum likelihood, policy gradients, forward KL, reverse KL, or related objective-level variants. We study a complementary factor: the state distribution on which supervision is applied. For an autoregressive policy, a state is a prompt plus generated prefix. SFT trains on fixed dataset states, while RL and on-policy distillation (OPD) train on states induced by the current learner. We formalize post-training as state-distribution shaping and run a controlled smallscale study using Qwen3-0.6B-Base on GSM8K, with TruthfulQA and MMLU as retention evaluations. Our results show three phenomena. First, a mild SFT run improves GSM8K with little forgetting, while a stress SFT run causes substantial retention loss. Second, OPD from a degraded SFT teacher surpasses that teacher on GSM8K, TruthfulQA, and MMLU, despite using the teacher as its only supervision source. Third, a lightweight on-policy RL run improves GSM8K while preserving retention. These results support a state-centric view of post-training: the source and locality of training states can be as important as the form of the supervision signal.

View on arXiv PDF

Similar