Embedded Knowledge Distillation in Depth-Level Dynamic Neural Network
This addresses the need for efficient, high-accuracy neural networks tailored to varying device resources, though it is incremental as it builds on existing dynamic network and knowledge distillation methods.
The paper tackles the problem of training multiple depth-level sub-networks for different computational devices by proposing a Depth-Level Dynamic Neural Network (DDNN) with Embedded Knowledge Distillation (EKD), resulting in sub-nets that achieve better performance than individually trained networks on datasets like CIFAR-10/100 and ImageNet while preserving full-net accuracy.
In real applications, different computation-resource devices need different-depth networks (e.g., ResNet-18/34/50) with high-accuracy. Usually, existing methods either design multiple networks and train them independently, or construct depth-level/width-level dynamic neural networks which is hard to prove the accuracy of each sub-net. In this article, we propose an elegant Depth-Level Dynamic Neural Network (DDNN) integrated different-depth sub-nets of similar architectures. To improve the generalization of sub-nets, we design the Embedded-Knowledge-Distillation (EKD) training mechanism for the DDNN to implement knowledge transfer from the teacher (full-net) to multiple students (sub-nets). Specifically, the Kullback-Leibler (KL) divergence is introduced to constrain the posterior class probability consistency between full-net and sub-nets, and self-attention distillation on the same resolution feature of different depth is addressed to drive more abundant feature representations of sub-nets. Thus, we can obtain multiple high-accuracy sub-nets simultaneously in a DDNN via the online knowledge distillation in each training iteration without extra computation cost. Extensive experiments on CIFAR-10/100, and ImageNet datasets demonstrate that sub-nets in DDNN with EKD training achieve better performance than individually training networks while preserving the original performance of full-nets.