AIMay 13

Respecting Self-Uncertainty in On-Policy Self-Distillation for Efficient LLM Reasoning

Junlong Ke, Zichen Wen, Weijia Li, Conghui He, Linfeng Zhang

arXiv:2605.1325584.41 citations

Predicted impact top 29% in AI · last 90 daysOriginality Incremental advance

AI Analysis

For researchers training LLMs for reasoning, this work offers a principled way to handle token-level uncertainty in self-distillation, leading to better efficiency-accuracy trade-offs.

The authors propose EGRSD and CL-EGRSD, which use teacher-entropy confidence gates to down-weight high-entropy token positions during on-policy self-distillation for LLM reasoning. On Qwen3-4B and Qwen3-8B, these methods improve the accuracy-length frontier compared to existing trainable approaches.

On-policy self-distillation trains a reasoning model on its own rollouts while a teacher, often the same model conditioned on privileged context, provides dense token-level supervision. Existing objectives typically weight the teacher's token-level signal uniformly across a chain-of-thought sequence, despite substantial variation in the entropy of the teacher's predictive distribution. We propose EGRSD (Entropy-Guided Reinforced Self-Distillation), which unifies token-level updates through three signals: a reward-grounded direction, a teacher-student likelihood-ratio magnitude, and the proposed teacher-entropy confidence gate that down-weights high-entropy token positions while maintaining a nonzero lower bound on every token weight. We further introduce CL-EGRSD, a causal-lookahead variant that distinguishes sustained high-entropy spans from transient high-entropy positions whose following context rapidly becomes low entropy. Experiments with Qwen3-4B and Qwen3-8B in thinking mode show that EGRSD and CL-EGRSD advance the accuracy-length frontier among the compared trainable methods.

View on arXiv PDF

Similar