CVApr 7, 2023

PSLT: A Light-weight Vision Transformer with Ladder Self-Attention and Progressive Shift

Gaojie Wu, Wei-Shi Zheng, Yutong Lu, Qi Tian

arXiv:2304.03481v15.927 citationsh-index: 68Has Code

Originality Incremental advance

AI Analysis

This addresses the resource-intensive nature of Vision Transformers for computer vision tasks, offering a more efficient alternative, though it is incremental as it builds on existing ViT methods.

The paper tackles the high computational cost of Vision Transformers by proposing PSLT, a light-weight transformer with ladder self-attention and progressive shift, achieving a top-1 accuracy of 79.9% on ImageNet-1k with 9.2M parameters and 1.9G FLOPs, comparable to models with over 20M parameters and 4G FLOPs.

Vision Transformer (ViT) has shown great potential for various visual tasks due to its ability to model long-range dependency. However, ViT requires a large amount of computing resource to compute the global self-attention. In this work, we propose a ladder self-attention block with multiple branches and a progressive shift mechanism to develop a light-weight transformer backbone that requires less computing resources (e.g. a relatively small number of parameters and FLOPs), termed Progressive Shift Ladder Transformer (PSLT). First, the ladder self-attention block reduces the computational cost by modelling local self-attention in each branch. In the meanwhile, the progressive shift mechanism is proposed to enlarge the receptive field in the ladder self-attention block by modelling diverse local self-attention for each branch and interacting among these branches. Second, the input feature of the ladder self-attention block is split equally along the channel dimension for each branch, which considerably reduces the computational cost in the ladder self-attention block (with nearly 1/3 the amount of parameters and FLOPs), and the outputs of these branches are then collaborated by a pixel-adaptive fusion. Therefore, the ladder self-attention block with a relatively small number of parameters and FLOPs is capable of modelling long-range interactions. Based on the ladder self-attention block, PSLT performs well on several vision tasks, including image classification, objection detection and person re-identification. On the ImageNet-1k dataset, PSLT achieves a top-1 accuracy of 79.9% with 9.2M parameters and 1.9G FLOPs, which is comparable to several existing models with more than 20M parameters and 4G FLOPs. Code is available at https://isee-ai.cn/wugaojie/PSLT.html.

View on arXiv PDF

Similar