SimReg: Regression as a Simple Yet Effective Tool for Self-supervised Knowledge Distillation
This work addresses the problem of efficiently distilling large self-supervised models into smaller ones for practitioners, though it is incremental as it builds on existing regression-based distillation techniques.
The paper tackled knowledge distillation from self-supervised models by proposing simple architectural changes, such as adding a multi-layer perceptron head during training, and found that this approach outperforms more complex state-of-the-art methods on the ImageNet dataset.
Feature regression is a simple way to distill large neural network models to smaller ones. We show that with simple changes to the network architecture, regression can outperform more complex state-of-the-art approaches for knowledge distillation from self-supervised models. Surprisingly, the addition of a multi-layer perceptron head to the CNN backbone is beneficial even if used only during distillation and discarded in the downstream task. Deeper non-linear projections can thus be used to accurately mimic the teacher without changing inference architecture and time. Moreover, we utilize independent projection heads to simultaneously distill multiple teacher networks. We also find that using the same weakly augmented image as input for both teacher and student networks aids distillation. Experiments on ImageNet dataset demonstrate the efficacy of the proposed changes in various self-supervised distillation settings.