CVDec 28, 2021

Pale Transformer: A General Vision Transformer Backbone with Pale-Shaped Attention

Sitong Wu, Tianyi Wu, Haoru Tan, Guodong Guo

arXiv:2112.14000v117.285 citationsHas Code

Originality Incremental advance

AI Analysis

This work addresses efficiency and performance limitations in vision transformer backbones for computer vision tasks, representing an incremental improvement over existing local attention methods.

The paper tackles the issue of insufficient context modeling in vision transformers due to local attention constraints by proposing Pale-Shaped self-Attention (PS-Attention), which reduces computation costs while capturing richer context, resulting in Pale Transformer achieving up to 84.9% Top-1 accuracy on ImageNet-1K and outperforming state-of-the-art models on downstream tasks like ADE20K and COCO.

Recently, Transformers have shown promising performance in various vision tasks. To reduce the quadratic computation complexity caused by the global self-attention, various methods constrain the range of attention within a local region to improve its efficiency. Consequently, their receptive fields in a single attention layer are not large enough, resulting in insufficient context modeling. To address this issue, we propose a Pale-Shaped self-Attention (PS-Attention), which performs self-attention within a pale-shaped region. Compared to the global self-attention, PS-Attention can reduce the computation and memory costs significantly. Meanwhile, it can capture richer contextual information under the similar computation complexity with previous local self-attention mechanisms. Based on the PS-Attention, we develop a general Vision Transformer backbone with a hierarchical architecture, named Pale Transformer, which achieves 83.4%, 84.3%, and 84.9% Top-1 accuracy with the model size of 22M, 48M, and 85M respectively for 224 ImageNet-1K classification, outperforming the previous Vision Transformer backbones. For downstream tasks, our Pale Transformer backbone performs better than the recent state-of-the-art CSWin Transformer by a large margin on ADE20K semantic segmentation and COCO object detection & instance segmentation. The code will be released on https://github.com/BR-IDL/PaddleViT.

View on arXiv PDF Code

Similar