LGAIMay 12

OGLS-SD: On-Policy Self-Distillation with Outcome-Guided Logit Steering for LLM Reasoning

arXiv:2605.1240083.31 citations
Predicted impact top 11% in LG · last 90 daysOriginality Incremental advance
AI Analysis

For researchers improving LLM reasoning via self-distillation, this work addresses a calibration issue in token-level supervision, offering a practical fix that yields consistent gains.

The paper identifies a mismatch between teacher and student responses in on-policy self-distillation for LLM reasoning, and proposes OGLS-SD, an outcome-guided logit-steering framework that uses verifiable rewards to calibrate teacher logits, improving reasoning performance over standard OPSD across diverse benchmarks.

We study {on-policy self-distillation} (OPSD), where a language model improves its reasoning ability by distilling privileged teacher distributions along its own on-policy trajectories. Despite the performance gains of OPSD, we identify a common but often overlooked mismatch between teacher and student responses: self-reflected teacher responses can be shifted by reflection-induced bias and response templates, leading to miscalibrated token-level supervision. To mitigate this issue, we propose \methodname, an outcome-guided logit-steering framework that leverages verifiable outcome rewards to contrast successful and failed on-policy trajectories and calibrate teacher logits. By combining outcome-level correctness with dense token-level guidance through logit steering, \methodname stabilizes self-distillation and improves reasoning performance over standard OPSD and other variants across diverse benchmarks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes