LGCVFeb 16, 2022

Meta Knowledge Distillation

arXiv:2202.07940v135 citations
Originality Highly original
AI Analysis

This work addresses a key bottleneck in knowledge distillation for training state-of-the-art models, offering a robust solution that improves efficiency and performance in computer vision tasks.

The paper tackles the degradation problems in knowledge distillation, such as teacher-student gap and incompatibility with strong data augmentations, by proposing Meta Knowledge Distillation (MKD) to meta-learn temperature parameters, achieving state-of-the-art performance with ViT-L at 86.5% accuracy on ImageNet-1K, 0.6% better than MAE with fewer training epochs.

Recent studies pointed out that knowledge distillation (KD) suffers from two degradation problems, the teacher-student gap and the incompatibility with strong data augmentations, making it not applicable to training state-of-the-art models, which are trained with advanced augmentations. However, we observe that a key factor, i.e., the temperatures in the softmax functions for generating probabilities of both the teacher and student models, was mostly overlooked in previous methods. With properly tuned temperatures, such degradation problems of KD can be much mitigated. However, instead of relying on a naive grid search, which shows poor transferability, we propose Meta Knowledge Distillation (MKD) to meta-learn the distillation with learnable meta temperature parameters. The meta parameters are adaptively adjusted during training according to the gradients of the learning objective. We validate that MKD is robust to different dataset scales, different teacher/student architectures, and different types of data augmentation. With MKD, we achieve the best performance with popular ViT architectures among compared methods that use only ImageNet-1K as training data, ranging from tiny to large models. With ViT-L, we achieve 86.5% with 600 epochs of training, 0.6% better than MAE that trains for 1,650 epochs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes