Expandable Residual Approximation for Knowledge Distillation
This addresses the capacity gap problem in knowledge distillation for computer vision, offering incremental improvements in model efficiency and performance.
The paper tackles the challenge of knowledge distillation by proposing Expandable Residual Approximation (ERA), a method that improves Top-1 accuracy on ImageNet by 1.41% and AP on MS COCO by 1.40.
Knowledge distillation (KD) aims to transfer knowledge from a large-scale teacher model to a lightweight one, significantly reducing computational and storage requirements. However, the inherent learning capacity gap between the teacher and student often hinders the sufficient transfer of knowledge, motivating numerous studies to address this challenge. Inspired by the progressive approximation principle in the Stone-Weierstrass theorem, we propose Expandable Residual Approximation (ERA), a novel KD method that decomposes the approximation of residual knowledge into multiple steps, reducing the difficulty of mimicking the teacher's representation through a divide-and-conquer approach. Specifically, ERA employs a Multi-Branched Residual Network (MBRNet) to implement this residual knowledge decomposition. Additionally, a Teacher Weight Integration (TWI) strategy is introduced to mitigate the capacity disparity by reusing the teacher's head weights. Extensive experiments show that ERA improves the Top-1 accuracy on the ImageNet classification benchmark by 1.41% and the AP on the MS COCO object detection benchmark by 1.40, as well as achieving leading performance across computer vision tasks. Codes and models are available at https://github.com/Zhaoyi-Yan/ERA.