LGAIMay 21

Post-Training is About States, Not Tokens: A State Distribution View of SFT, RL, and On-Policy Distillation

arXiv:2605.2273157.6
Predicted impact top 40% in LG · last 90 daysOriginality Incremental advance
AI Analysis

For practitioners of LLM post-training, this work highlights that the choice of training states (on-policy vs. fixed dataset) can significantly impact model performance and retention, offering a complementary perspective to loss-centric analyses.

The paper argues that the state distribution (prompts plus generated prefixes) used during post-training is as important as the loss function, showing through experiments on Qwen3-0.6B-Base that on-policy distillation and RL can improve performance on GSM8K while preserving retention, unlike stress SFT which causes forgetting.

Large language model post-training methods such as supervised fine-tuning (SFT), reinforcement learning (RL), and distillation are often analyzed through their loss functions: maximum likelihood, policy gradients, forward KL, reverse KL, or related objective-level variants. We study a complementary factor: the state distribution on which supervision is applied. For an autoregressive policy, a state is a prompt plus generated prefix. SFT trains on fixed dataset states, while RL and on-policy distillation (OPD) train on states induced by the current learner. We formalize post-training as state-distribution shaping and run a controlled smallscale study using Qwen3-0.6B-Base on GSM8K, with TruthfulQA and MMLU as retention evaluations. Our results show three phenomena. First, a mild SFT run improves GSM8K with little forgetting, while a stress SFT run causes substantial retention loss. Second, OPD from a degraded SFT teacher surpasses that teacher on GSM8K, TruthfulQA, and MMLU, despite using the teacher as its only supervision source. Third, a lightweight on-policy RL run improves GSM8K while preserving retention. These results support a state-centric view of post-training: the source and locality of training states can be as important as the form of the supervision signal.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes