CL LGFeb 24, 2025

Knowledge Distillation with Training Wheels

Guanlin Liu, Anand Ramachandran, Tanmay Gangwani, Yan Fu, Abhinav Sethy

arXiv:2502.17717v18.33 citationsh-index: 28

Originality Incremental advance

AI Analysis

This incremental work addresses efficiency and adaptability in generative language modeling for applications like translation and summarization.

The paper tackles knowledge distillation by introducing a framework where a student model learns from a teacher during training and can request teacher help at test-time under constraints, showing improved accuracy and flexibility compared to Speculative Decoding in translation and summarization tasks.

Knowledge distillation is used, in generative language modeling, to train a smaller student model using the help of a larger teacher model, resulting in improved capabilities for the student model. In this paper, we formulate a more general framework for knowledge distillation where the student learns from the teacher during training, and also learns to ask for the teacher's help at test-time following rules specifying test-time restrictions. Towards this, we first formulate knowledge distillation as an entropy-regularized value optimization problem. Adopting Path Consistency Learning to solve this, leads to a new knowledge distillation algorithm using on-policy and off-policy demonstrations. We extend this using constrained reinforcement learning to a framework that incorporates the use of the teacher model as a test-time reference, within constraints. In this situation, akin to a human learner, the model needs to learn not only the learning material, but also the relative difficulty of different sections to prioritize for seeking teacher help. We examine the efficacy of our method through experiments in translation and summarization tasks, observing trends in accuracy and teacher use, noting that our approach unlocks operating points not available to the popular Speculative Decoding approach.

View on arXiv PDF

Similar