Co-advise: Cross Inductive Bias Distillation
This work addresses a practical limitation for researchers and practitioners using vision transformers in data-scarce scenarios, representing an incremental improvement over existing distillation methods.
The paper tackles the problem of vision transformers underperforming with limited training data by proposing a distillation method that uses multiple lightweight teachers with different architectural inductive biases to co-advise the student transformer, resulting in CivT transformers outperforming all previous transformers of the same architecture on ImageNet.
Transformers recently are adapted from the community of natural language processing as a promising substitute of convolution-based neural networks for visual learning tasks. However, its supremacy degenerates given an insufficient amount of training data (e.g., ImageNet). To make it into practical utility, we propose a novel distillation-based method to train vision transformers. Unlike previous works, where merely heavy convolution-based teachers are provided, we introduce lightweight teachers with different architectural inductive biases (e.g., convolution and involution) to co-advise the student transformer. The key is that teachers with different inductive biases attain different knowledge despite that they are trained on the same dataset, and such different knowledge compounds and boosts the student's performance during distillation. Equipped with this cross inductive bias distillation method, our vision transformers (termed as CivT) outperform all previous transformers of the same architecture on ImageNet.