CVMay 18, 2023

Student-friendly Knowledge Distillation

arXiv:2305.10893v136 citations
Originality Incremental advance
AI Analysis

This is an incremental improvement for knowledge distillation in machine learning, making it easier for student models to learn from teachers.

The paper tackles the problem of complex teacher outputs hindering student learning in knowledge distillation by proposing student-friendly knowledge distillation (SKD), which simplifies teacher knowledge using softening and attention mechanisms, achieving state-of-the-art performance on CIFAR-100 and ImageNet datasets.

In knowledge distillation, the knowledge from the teacher model is often too complex for the student model to thoroughly process. However, good teachers in real life always simplify complex material before teaching it to students. Inspired by this fact, we propose student-friendly knowledge distillation (SKD) to simplify teacher output into new knowledge representations, which makes the learning of the student model easier and more effective. SKD contains a softening processing and a learning simplifier. First, the softening processing uses the temperature hyperparameter to soften the output logits of the teacher model, which simplifies the output to some extent and makes it easier for the learning simplifier to process. The learning simplifier utilizes the attention mechanism to further simplify the knowledge of the teacher model and is jointly trained with the student model using the distillation loss, which means that the process of simplification is correlated with the training objective of the student model and ensures that the simplified new teacher knowledge representation is more suitable for the specific student model. Furthermore, since SKD does not change the form of the distillation loss, it can be easily combined with other distillation methods that are based on the logits or features of intermediate layers to enhance its effectiveness. Therefore, SKD has wide applicability. The experimental results on the CIFAR-100 and ImageNet datasets show that our method achieves state-of-the-art performance while maintaining high training efficiency.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes