CVDec 1, 2020

Multi-level Knowledge Distillation via Knowledge Alignment and Correlation

Fei Ding, Yin Yang, Hongxin Hu, Venkat Krovi, Feng Luo

arXiv:2012.00573v23.36 citationsHas Code

Originality Incremental advance

AI Analysis

This work addresses the problem of comprehensively transferring knowledge from teacher to student models for model compression and knowledge transfer, which is relevant for researchers and practitioners in machine learning.

This paper introduces Multi-level Knowledge Distillation (MLKD), a method that simultaneously considers knowledge alignment (individual sample knowledge) and knowledge correlation (relational knowledge between samples). MLKD is shown to outperform state-of-the-art methods across various pretraining strategies, network architectures, datasets, and tasks, improving the reliability and transferability of learned representations.

Knowledge distillation (KD) has become an important technique for model compression and knowledge transfer. In this work, we first perform a comprehensive analysis of the knowledge transferred by different KD methods. We demonstrate that traditional KD methods, which minimize the KL divergence of softmax outputs between networks, are related to the knowledge alignment of an individual sample only. Meanwhile, recent contrastive learning-based KD methods mainly transfer relational knowledge between different samples, namely, knowledge correlation. While it is important to transfer the full knowledge from teacher to student, we introduce the Multi-level Knowledge Distillation (MLKD) by effectively considering both knowledge alignment and correlation. MLKD is task-agnostic and model-agnostic, and can easily transfer knowledge from supervised or self-supervised pretrained teachers. We show that MLKD can improve the reliability and transferability of learned representations. Experiments demonstrate that MLKD outperforms other state-of-the-art methods on a large number of experimental settings including different (a) pretraining strategies (b) network architectures (c) datasets (d) tasks.

View on arXiv PDF Code

Similar