SimViT: Exploring a Simple Vision Transformer with sliding windows
This work addresses the need for efficient and accurate vision Transformers in image processing tasks, representing an incremental improvement by modifying attention mechanisms and using sliding windows.
The paper tackles the problem of vision Transformers disrupting spatial and local correlations in images by introducing SimViT, a simple vision Transformer that incorporates spatial structure and local information, achieving 71.1% top-1 accuracy on ImageNet-1k with only 3.3M parameters.
Although vision Transformers have achieved excellent performance as backbone models in many vision tasks, most of them intend to capture global relations of all tokens in an image or a window, which disrupts the inherent spatial and local correlations between patches in 2D structure. In this paper, we introduce a simple vision Transformer named SimViT, to incorporate spatial structure and local information into the vision Transformers. Specifically, we introduce Multi-head Central Self-Attention(MCSA) instead of conventional Multi-head Self-Attention to capture highly local relations. The introduction of sliding windows facilitates the capture of spatial structure. Meanwhile, SimViT extracts multi-scale hierarchical features from different layers for dense prediction tasks. Extensive experiments show the SimViT is effective and efficient as a general-purpose backbone model for various image processing tasks. Especially, our SimViT-Micro only needs 3.3M parameters to achieve 71.1% top-1 accuracy on ImageNet-1k dataset, which is the smallest size vision Transformer model by now. Our code will be available in https://github.com/ucasligang/SimViT.