AdaSwitch: Adaptive Switching Generation for Knowledge Distillation
This addresses the challenge of achieving high performance in small language models for latency-constrained applications, though it is incremental as it builds on existing distillation methods.
The paper tackles the trade-offs in knowledge distillation for small language models by proposing AdaSwitch, which dynamically combines on-policy and off-policy generation at the token level, resulting in consistent accuracy improvements across three datasets with two teacher-student pairs.
Small language models (SLMs) are crucial for applications with strict latency and computational constraints, yet achieving high performance remains challenging. Knowledge distillation (KD) can transfer capabilities from large teacher models, but existing methods involve trade-offs: off-policy distillation provides high-quality supervision but introduces a training-inference mismatch, while on-policy approaches maintain consistency but rely on low-quality student outputs. To address these issues, we propose AdaSwitch, a novel approach that dynamically combines on-policy and off-policy generation at the token level. AdaSwitch allows the student to first explore its own predictions and then selectively integrate teacher guidance based on real-time quality assessment. This approach simultaneously preserves consistency and maintains supervision quality. Experiments on three datasets with two teacher-student LLM pairs demonstrate that AdaSwitch consistently improves accuracy, offering a practical and effective method for distilling SLMs with acceptable additional overhead.