LGMay 25, 2025

Online Knowledge Distillation with Reward Guidance

arXiv:2505.18952v11 citationsh-index: 1
Originality Incremental advance
AI Analysis

This work addresses the problem of efficiently distilling knowledge from large language models for researchers and practitioners, though it appears incremental as it builds on existing preference optimization methods.

The paper tackles knowledge distillation for large language models by proposing a reward-guided imitation learning framework that minimizes the performance gap between student and teacher policies through min-max optimization, achieving effective results as shown in theoretical and empirical analyses.

This work studies knowledge distillation (KD) for large language models (LLMs) through preference optimization. We propose a reward-guided imitation learning framework for sequential KD, formulating a min-max optimization problem between the policy and reward model (RM) to minimize the performance gap between the student and teacher policies. Specifically, the reward optimization is constrained to achieve near-optimality within a confidence set for preference alignment. For preference data construction, we explore both offline and online preference-based KD. Additionally, we reformulate the RM using the $Q$-value function and extend the framework to white-box KD, where the teacher policy's predicted probabilities are accessible. Theoretical analysis and empirical results demonstrate the effectiveness of the proposed framework.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes