LGDec 22, 2023

Less or More From Teacher: Exploiting Trilateral Geometry For Knowledge Distillation

Chengming Hu, Haolun Wu, Xuan Li, Chen Ma, Xi Chen, Jun Yan, Boyu Wang, Xue Liu

arXiv:2312.15112v35.34 citationsh-index: 17ICLR

Originality Incremental advance

AI Analysis

This work addresses a key bottleneck in knowledge distillation for practitioners by providing a more flexible and effective way to combine supervisory signals, though it is incremental in nature.

The paper tackles the challenge of balancing teacher and ground truth supervision in knowledge distillation by introducing an adaptive method that learns sample-wise fusion ratios based on trilateral geometric relations, achieving consistent improvements across tasks like image classification and attack detection.

Knowledge distillation aims to train a compact student network using soft supervision from a larger teacher network and hard supervision from ground truths. However, determining an optimal knowledge fusion ratio that balances these supervisory signals remains challenging. Prior methods generally resort to a constant or heuristic-based fusion ratio, which often falls short of a proper balance. In this study, we introduce a novel adaptive method for learning a sample-wise knowledge fusion ratio, exploiting both the correctness of teacher and student, as well as how well the student mimics the teacher on each sample. Our method naturally leads to the intra-sample trilateral geometric relations among the student prediction ($S$), teacher prediction ($T$), and ground truth ($G$). To counterbalance the impact of outliers, we further extend to the inter-sample relations, incorporating the teacher's global average prediction $\bar{T}$ for samples within the same class. A simple neural network then learns the implicit mapping from the intra- and inter-sample relations to an adaptive, sample-wise knowledge fusion ratio in a bilevel-optimization manner. Our approach provides a simple, practical, and adaptable solution for knowledge distillation that can be employed across various architectures and model sizes. Extensive experiments demonstrate consistent improvements over other loss re-weighting methods on image classification, attack detection, and click-through rate prediction.

View on arXiv PDF

Similar