On Good Practices for Task-Specific Distillation of Large Pretrained Visual Models
This work addresses the need for efficient specialized models in real-world applications, but it is incremental as it refines existing distillation methods.
The paper tackles the problem of distilling large pretrained visual models into compact task-specific ones, showing that existing practices are suboptimal and proposing new guidelines, with a Mixup variant improving distillation without engineered prompts.
Large pretrained visual models exhibit remarkable generalization across diverse recognition tasks. Yet, real-world applications often demand compact models tailored to specific problems. Variants of knowledge distillation have been devised for such a purpose, enabling task-specific compact models (the students) to learn from a generic large pretrained one (the teacher). In this paper, we show that the excellent robustness and versatility of recent pretrained models challenge common practices established in the literature, calling for a new set of optimal guidelines for task-specific distillation. To address the lack of samples in downstream tasks, we also show that a variant of Mixup based on stable diffusion complements standard data augmentation. This strategy eliminates the need for engineered text prompts and improves distillation of generic models into streamlined specialized networks.