ViT-P: Rethinking Data-efficient Vision Transformers from Locality
This work addresses the data inefficiency of vision transformers for computer vision tasks, making them competitive with convolutional neural networks on small datasets.
The paper tackled the problem of vision transformers requiring large datasets to train effectively by introducing multi-focal attention bias to constrain self-attention to multi-scale localized receptive fields, achieving state-of-the-art accuracy of 83.16% on Cifar100 when trained from scratch.
Recent advances of Transformers have brought new trust to computer vision tasks. However, on small dataset, Transformers is hard to train and has lower performance than convolutional neural networks. We make vision transformers as data-efficient as convolutional neural networks by introducing multi-focal attention bias. Inspired by the attention distance in a well-trained ViT, we constrain the self-attention of ViT to have multi-scale localized receptive field. The size of receptive field is adaptable during training so that optimal configuration can be learned. We provide empirical evidence that proper constrain of receptive field can reduce the amount of training data for vision transformers. On Cifar100, our ViT-P Base model achieves the state-of-the-art accuracy (83.16%) trained from scratch. We also perform analysis on ImageNet to show our method does not lose accuracy on large data sets.