Adaptive Teacher Exposure for Self-Distillation in LLM Reasoning

Zihao Han, Tiangang Zhang, Huaibin Wang, Yilun Sun

arXiv:2605.1145841.91 citations

AI Analysis

For researchers improving LLM reasoning via self-distillation, this work identifies and solves a previously overlooked design flaw (fixed full teacher exposure), offering a new adaptive mechanism that yields consistent improvements.

ATESD introduces a learnable teacher exposure controller for self-distillation in LLM reasoning, addressing the mismatch when the teacher sees too much privileged reasoning. It achieves consistent gains of +0.95 to +2.33 Average@12 points over OPSD on AIME 24/25 and HMMT 25 across Qwen3-1.7B/4B/8B.

On-policy self-distillation has become a strong recipe for LLM reasoning, where a privileged teacher supervises the student's own rollouts while conditioning on the reference solution. A design choice shared by nearly all such methods, however, has gone unquestioned: the teacher always sees the full reference reasoning. We argue that this default itself is part of the problem and identify a teacher-side exposure mismatch: when the teacher conditions on reasoning far beyond the student's current competence, the resulting token targets become too strong to absorb. A controlled fixed-exposure sweep makes this concrete on two fronts: 1) full exposure is not reliably the best choice, and 2) student-teacher mismatch grows monotonically as the teacher sees more privileged reasoning. This motivates treating teacher exposure not as a fixed hyperparameter but as a learnable training-time control variable. We therefore propose Adaptive Teacher Exposure for Self-Distillation (ATESD). ATESD models the reveal ratio with a lightweight Beta-policy controller conditioned on compact training-state statistics, and uses one sampled exposure for a short hold window of student updates. To make this exposure controller learnable, we optimize it with a discounted learning-progress reward that scores each held decision by its effect on the student's future improvement rather than its immediate loss change, addressing the delayed credit assignment induced by on-policy distillation. Experiments on AIME 24, AIME 25, and HMMT 25 across Qwen3-{1.7B, 4B, 8B} show that ATESD consistently outperforms competitive self-distillation and RL baselines, improving over OPSD by +0.95, +2.05, and +2.33 Average@12 points respectively, and establishing adaptive teacher exposure as an effective new axis for reasoning self-distillation.

View on arXiv PDF

Similar