Continual Distillation of Teachers from Different Domains
For practitioners deploying large models, this work addresses the problem of distilling knowledge from multiple domain-specific teachers without storing original training data, offering a method to maintain performance across domains.
This paper introduces Continual Distillation (CD), a new paradigm where a student learns sequentially from a stream of teacher models without retaining access to earlier teachers, and proposes Self External Data Distillation (SE2D) to mitigate knowledge forgetting and improve cross-domain generalization. Experiments show SE2D reduces Unseen Knowledge Forgetting and improves cross-domain generalization.
Deep learning models continue to scale, with some requiring more storage than many large-scale datasets. Thus, we introduce a new paradigm: Continual Distillation (CD), where a student learns sequentially from a stream of teacher models without retaining access to earlier teachers. CD faces two challenges: teacher training data is unavailable, and teachers have varying expertise. We show that external unlabeled data enables Unseen Knowledge Transfer (UKT), allowing the student to acquire information from domains not present in the training data, while known to the teacher. We also show that sequential distillation causes Unseen Knowledge Forgetting (UKF) when transferred knowledge is lost after training on later teachers. To better trade off between UKT and UKF, we propose Self External Data Distillation (SE2D), a method that preserves logits on external data to stabilize learning across heterogeneous teachers. Experiments on multiple benchmarks show that SE2D reduces UKF and improves cross-domain generalization. The code and implementation for this work are publicly available at: https://github.com/Nicolas1203/continual_distillation.