LGCVMLFeb 21, 2020

Residual Knowledge Distillation

arXiv:2002.09168v132 citations
AI Analysis

This addresses model compression for efficient deployment, but it is incremental as it builds on existing knowledge distillation methods.

The paper tackles performance degradation in knowledge distillation due to capacity gaps between teacher and student models by proposing Residual Knowledge Distillation (RKD), which uses an assistant to learn residual errors, achieving state-of-the-art results on CIFAR-100 and ImageNet.

Knowledge distillation (KD) is one of the most potent ways for model compression. The key idea is to transfer the knowledge from a deep teacher model (T) to a shallower student (S). However, existing methods suffer from performance degradation due to the substantial gap between the learning capacities of S and T. To remedy this problem, this work proposes Residual Knowledge Distillation (RKD), which further distills the knowledge by introducing an assistant (A). Specifically, S is trained to mimic the feature maps of T, and A aids this process by learning the residual error between them. In this way, S and A complement with each other to get better knowledge from T. Furthermore, we devise an effective method to derive S and A from a given model without increasing the total computational cost. Extensive experiments show that our approach achieves appealing results on popular classification datasets, CIFAR-100 and ImageNet, surpassing state-of-the-art methods.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes