LGAIMay 13

Teacher-Guided Policy Optimization for LLM Distillation

arXiv:2605.1323098.6
Predicted impact top 1% in LG · last 90 daysOriginality Incremental advance
AI Analysis

Improves on-policy LLM distillation for practitioners, but is an incremental algorithmic fix for a known bottleneck.

TGPO addresses the failure of Reverse KL in LLM distillation when student-teacher distributions diverge, achieving significant improvements on complex reasoning benchmarks over standard baselines.

The convergence of reinforcement learning and imitation learning has positioned Reverse KL (RKL) as a promising paradigm for on-policy LLM distillation, aiming to unify exploration with teacher supervision. However, we identify a critical limitation: when the student and teacher distributions diverge significantly, standard RKL often fails to yield meaningful improvement due to uninformative negative feedback. To address this inefficiency, we propose Teacher-Guided Policy Optimization (TGPO), an on-policy algorithm that incorporates dense directional guidance by leveraging teacher predictions conditioned on the student's rollout. Because TGPO remains on-policy, the algorithm integrates seamlessly with existing RLVR frameworks without requiring additional data annotation. Experiments on complex reasoning benchmarks demonstrate that TGPO significantly outperforms standard baselines and is robust to different teachers.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes