Vision Transformer for Small-Size Datasets
This work addresses the challenge of applying ViTs to small datasets, which is incremental as it builds on existing ViT architectures with add-on modules.
The paper tackled the problem of Vision Transformers (ViTs) requiring large datasets for pre-training by proposing Shifted Patch Tokenization (SPT) and Locality Self-Attention (LSA) to address low locality inductive bias, enabling learning from scratch on small datasets and improving performance by an average of 2.96% on Tiny-ImageNet.
Recently, the Vision Transformer (ViT), which applied the transformer structure to the image classification task, has outperformed convolutional neural networks. However, the high performance of the ViT results from pre-training using a large-size dataset such as JFT-300M, and its dependence on a large dataset is interpreted as due to low locality inductive bias. This paper proposes Shifted Patch Tokenization (SPT) and Locality Self-Attention (LSA), which effectively solve the lack of locality inductive bias and enable it to learn from scratch even on small-size datasets. Moreover, SPT and LSA are generic and effective add-on modules that are easily applicable to various ViTs. Experimental results show that when both SPT and LSA were applied to the ViTs, the performance improved by an average of 2.96% in Tiny-ImageNet, which is a representative small-size dataset. Especially, Swin Transformer achieved an overwhelming performance improvement of 4.08% thanks to the proposed SPT and LSA.