Generic-to-Specific Distillation of Masked Autoencoders
This addresses the challenge of improving generalization in small vision models for computer vision tasks, though it is incremental as it builds on existing distillation paradigms.
The paper tackles the problem of lightweight vision Transformers benefiting little from self-supervised pre-training by proposing a two-stage distillation method (G2SD) that transfers both task-agnostic and task-specific knowledge from large to small models, achieving 98.7%, 98.1%, and 99.3% of teacher performance on image classification, object detection, and semantic segmentation.
Large vision Transformers (ViTs) driven by self-supervised pre-training mechanisms achieved unprecedented progress. Lightweight ViT models limited by the model capacity, however, benefit little from those pre-training mechanisms. Knowledge distillation defines a paradigm to transfer representations from large (teacher) models to small (student) ones. However, the conventional single-stage distillation easily gets stuck on task-specific transfer, failing to retain the task-agnostic knowledge crucial for model generalization. In this study, we propose generic-to-specific distillation (G2SD), to tap the potential of small ViT models under the supervision of large models pre-trained by masked autoencoders. In generic distillation, decoder of the small model is encouraged to align feature predictions with hidden representations of the large model, so that task-agnostic knowledge can be transferred. In specific distillation, predictions of the small model are constrained to be consistent with those of the large model, to transfer task-specific features which guarantee task performance. With G2SD, the vanilla ViT-Small model respectively achieves 98.7%, 98.1% and 99.3% the performance of its teacher (ViT-Base) for image classification, object detection, and semantic segmentation, setting a solid baseline for two-stage vision distillation. Code will be available at https://github.com/pengzhiliang/G2SD.