OGLS-SD: On-Policy Self-Distillation with Outcome-Guided Logit Steering for LLM Reasoning
For researchers improving LLM reasoning via self-distillation, this work addresses a calibration issue in token-level supervision, offering a practical fix that yields consistent gains.
The paper identifies a mismatch between teacher and student responses in on-policy self-distillation for LLM reasoning, and proposes OGLS-SD, an outcome-guided logit-steering framework that uses verifiable rewards to calibrate teacher logits, improving reasoning performance over standard OPSD across diverse benchmarks.
We study {on-policy self-distillation} (OPSD), where a language model improves its reasoning ability by distilling privileged teacher distributions along its own on-policy trajectories. Despite the performance gains of OPSD, we identify a common but often overlooked mismatch between teacher and student responses: self-reflected teacher responses can be shifted by reflection-induced bias and response templates, leading to miscalibrated token-level supervision. To mitigate this issue, we propose \methodname, an outcome-guided logit-steering framework that leverages verifiable outcome rewards to contrast successful and failed on-policy trajectories and calibrate teacher logits. By combining outcome-level correctness with dense token-level guidance through logit steering, \methodname stabilizes self-distillation and improves reasoning performance over standard OPSD and other variants across diverse benchmarks.