MoKD: Multi-Task Optimization for Knowledge Distillation
This addresses efficiency and performance issues in training compact models for computer vision tasks, but it is incremental as it builds on existing knowledge distillation methods.
The paper tackled the challenges of balancing learning from teacher guidance and task objectives, and handling knowledge representation disparity in Knowledge Distillation, by proposing MoKD, which reformulates it as a multi-objective optimization problem and introduces subspace learning, achieving state-of-the-art performance on ImageNet-1K and COCO datasets.
Compact models can be effectively trained through Knowledge Distillation (KD), a technique that transfers knowledge from larger, high-performing teacher models. Two key challenges in Knowledge Distillation (KD) are: 1) balancing learning from the teacher's guidance and the task objective, and 2) handling the disparity in knowledge representation between teacher and student models. To address these, we propose Multi-Task Optimization for Knowledge Distillation (MoKD). MoKD tackles two main gradient issues: a) Gradient Conflicts, where task-specific and distillation gradients are misaligned, and b) Gradient Dominance, where one objective's gradient dominates, causing imbalance. MoKD reformulates KD as a multi-objective optimization problem, enabling better balance between objectives. Additionally, it introduces a subspace learning framework to project feature representations into a high-dimensional space, improving knowledge transfer. Our MoKD is demonstrated to outperform existing methods through extensive experiments on image classification using the ImageNet-1K dataset and object detection using the COCO dataset, achieving state-of-the-art performance with greater efficiency. To the best of our knowledge, MoKD models also achieve state-of-the-art performance compared to models trained from scratch.