Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models
This addresses the need for more efficient and effective reasoning in large language models, particularly for mathematical tasks, though it is incremental as it builds on existing distillation techniques.
The paper tackles the problem of improving large language model reasoning by introducing On-Policy Self-Distillation (OPSD), where a single model acts as both teacher and student to reduce distribution mismatch and leverage ground-truth solutions, achieving 4-8x token efficiency over reinforcement learning methods and superior performance over off-policy distillation on mathematical reasoning benchmarks.
Knowledge distillation improves large language model (LLM) reasoning by compressing the knowledge of a teacher LLM to train smaller LLMs. On-policy distillation advances this approach by having the student sample its own trajectories while a teacher LLM provides dense token-level supervision, addressing the distribution mismatch between training and inference in off-policy distillation methods. However, on-policy distillation typically requires a separate, often larger, teacher LLM and does not explicitly leverage ground-truth solutions available in reasoning datasets. Inspired by the intuition that a sufficiently capable LLM can rationalize external privileged reasoning traces and teach its weaker self (i.e., the version without access to privileged information), we introduce On-Policy Self-Distillation (OPSD), a framework where a single model acts as both teacher and student by conditioning on different contexts. The teacher policy conditions on privileged information (e.g., verified reasoning traces) while the student policy sees only the question; training minimizes the per-token divergence between these distributions over the student's own rollouts. We demonstrate the efficacy of our method on multiple mathematical reasoning benchmarks, achieving 4-8x token efficiency compared to reinforcement learning methods such as GRPO and superior performance over off-policy distillation methods.