Following the Teacher's Footsteps: Scheduled Checkpoint Distillation for Domain-Specific LLMs
This work addresses the challenge of deploying large language models for domain-specific tasks by improving distillation efficiency, though it is incremental as it builds on existing distillation methods.
The paper tackles the problem of suboptimal performance in domain-specific LLM distillation due to capacity gaps, proposing Scheduled Checkpoint Distillation with Adaptive Weighting to enable student models to match or exceed teacher performance on tasks like QA, NER, and text classification across multiple languages.
Large language models (LLMs) are challenging to deploy for domain-specific tasks due to their massive scale. While distilling a fine-tuned LLM into a smaller student model is a promising alternative, the capacity gap between teacher and student often leads to suboptimal performance. This raises a key question: when and how can a student model match or even surpass its teacher on domain-specific tasks? In this work, we propose a novel theoretical insight: a student can outperform its teacher if its advantage on a Student-Favored Subdomain (SFS) outweighs its deficit on the Teacher-Favored Subdomain (TFS). Guided by this insight, we propose Scheduled Checkpoint Distillation (SCD), which reduces the TFS deficit by emulating the teacher's convergence process during supervised fine-tuning (SFT) on the domain task, and a sample-wise Adaptive Weighting (AW) mechanism to preserve student strengths on SFS. Experiments across diverse domain tasks--including QA, NER, and text classification in multiple languages--show that our method consistently outperforms existing distillation approaches, allowing the student model to match or even exceed the performance of its fine-tuned teacher.