CV LGJan 31, 2022

BOAT: Bilateral Local Attention Vision Transformer

Tan Yu, Gangming Zhao, Ping Li, Yizhou Yu

arXiv:2201.13027v27.331 citations

Originality Incremental advance

AI Analysis

This work addresses efficiency and performance issues in Vision Transformers for computer vision tasks, offering an incremental improvement over existing local attention methods.

The paper tackles the limitation of window-based local self-attention in Vision Transformers, which fails to capture relationships between distant but similar patches, by proposing BOAT, a Bilateral Local Attention Vision Transformer that integrates feature-space local attention with image-space local attention, resulting in clear and consistent outperformance over state-of-the-art models on benchmark datasets.

Vision Transformers achieved outstanding performance in many computer vision tasks. Early Vision Transformers such as ViT and DeiT adopt global self-attention, which is computationally expensive when the number of patches is large. To improve efficiency, recent Vision Transformers adopt local self-attention mechanisms, where self-attention is computed within local windows. Despite the fact that window-based local self-attention significantly boosts efficiency, it fails to capture the relationships between distant but similar patches in the image plane. To overcome this limitation of image-space local attention, in this paper, we further exploit the locality of patches in the feature space. We group the patches into multiple clusters using their features, and self-attention is computed within every cluster. Such feature-space local attention effectively captures the connections between patches across different local windows but still relevant. We propose a Bilateral lOcal Attention vision Transformer (BOAT), which integrates feature-space local attention with image-space local attention. We further integrate BOAT with both Swin and CSWin models, and extensive experiments on several benchmark datasets demonstrate that our BOAT-CSWin model clearly and consistently outperforms existing state-of-the-art CNN models and vision Transformers.

View on arXiv PDF

Similar