Teacher-Guided Policy Optimization for LLM Distillation
Improves on-policy LLM distillation for practitioners, but is an incremental algorithmic fix for a known bottleneck.
TGPO addresses the failure of Reverse KL in LLM distillation when student-teacher distributions diverge, achieving significant improvements on complex reasoning benchmarks over standard baselines.
The convergence of reinforcement learning and imitation learning has positioned Reverse KL (RKL) as a promising paradigm for on-policy LLM distillation, aiming to unify exploration with teacher supervision. However, we identify a critical limitation: when the student and teacher distributions diverge significantly, standard RKL often fails to yield meaningful improvement due to uninformative negative feedback. To address this inefficiency, we propose Teacher-Guided Policy Optimization (TGPO), an on-policy algorithm that incorporates dense directional guidance by leveraging teacher predictions conditioned on the student's rollout. Because TGPO remains on-policy, the algorithm integrates seamlessly with existing RLVR frameworks without requiring additional data annotation. Experiments on complex reasoning benchmarks demonstrate that TGPO significantly outperforms standard baselines and is robust to different teachers.