CLAIMay 29

Your Teacher Can't Help You Here: Combating Supervision Fidelity Decay in On-Policy Distillation

arXiv:2605.308330.47h-index: 48
AI Analysis70

This paper tackles the problem of supervision fidelity decay in on-policy distillation, which is crucial for improving the performance of student models in tasks requiring long reasoning chains.

This paper addresses Supervision Fidelity Decay (SFD) in on-policy distillation, where teacher feedback weakens over long reasoning chains. They introduce Lookahead Group Reward (LGR) to evaluate student tokens based on theacher confidence at the subsequent step, improving mean@8 by 2.57 points over OPD for a 7B student across six benchmarks, with gains up to 4.92 points on AIME-26.

On-policy distillation transfers reasoning capabilities by training a student model on its own generated trajectories using token-level feedback from a teacher. However, we identify a critical bottleneck, \textbf{Supervision Fidelity Decay (SFD)}: as student-generated prefixes lengthen, the teacher's next-token distribution becomes less confident and less discriminative. Consequently, the teacher-dependent corrective signal in reverse-KL distillation weakens, causing student drift to compound across long reasoning chains. To mitigate SFD, we introduce \textbf{Lookahead Group Reward (\ours{})}. Building on the insight that next-step teacher confidence reflects the discriminative strength of future reverse-KL supervision, \ours{} evaluates the student's top-K candidate tokens by the teacher confidence they induce at the subsequent step and assigns a group-normalized reward. To maintain computational efficiency, we further design an entropy-triggered tree-attention mechanism. Across six math and code benchmarks, \ours{} improves mean@8 by \textbf{2.57} points over OPD for a 7B student, with gains increasing in longer-generation and reaching +\textbf{4.92} points on AIME-26 at 39k tokens.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes