Towards Zero-Shot Knowledge Distillation for Natural Language Processing
This work addresses the critical problem of knowledge transfer for model compression in NLP when task-specific training data is unavailable due to privacy or proprietary constraints, offering a significant step towards more practical and deployable NLP solutions.
This paper introduces the first Zero-Shot Knowledge Distillation (KD) method for NLP, enabling student models to learn from larger teacher models without access to task-specific training data. The method achieves 75% to 92% of the teacher's classification score across six GLUE benchmark tasks while compressing the model 30 times.
Knowledge Distillation (KD) is a common knowledge transfer algorithm used for model compression across a variety of deep learning based natural language processing (NLP) solutions. In its regular manifestations, KD requires access to the teacher's training data for knowledge transfer to the student network. However, privacy concerns, data regulations and proprietary reasons may prevent access to such data. We present, to the best of our knowledge, the first work on Zero-Shot Knowledge Distillation for NLP, where the student learns from the much larger teacher without any task specific data. Our solution combines out of domain data and adversarial training to learn the teacher's output distribution. We investigate six tasks from the GLUE benchmark and demonstrate that we can achieve between 75% and 92% of the teacher's classification score (accuracy or F1) while compressing the model 30 times.