CVLGJun 23, 2023

Swin-Free: Achieving Better Cross-Window Attention and Efficiency with Size-varying Window

arXiv:2306.13776v1h-index: 11
Originality Incremental advance
AI Analysis

This work addresses efficiency issues in vision transformers for computer vision applications, representing an incremental improvement.

The paper tackles the memory copy overhead in Swin Transformer's shifting windows by proposing Swin-Free, which uses size-varying windows for cross-window connections, resulting in faster inference and better accuracy compared to Swin Transformer.

Transformer models have shown great potential in computer vision, following their success in language tasks. Swin Transformer is one of them that outperforms convolution-based architectures in terms of accuracy, while improving efficiency when compared to Vision Transformer (ViT) and its variants, which have quadratic complexity with respect to the input size. Swin Transformer features shifting windows that allows cross-window connection while limiting self-attention computation to non-overlapping local windows. However, shifting windows introduces memory copy operations, which account for a significant portion of its runtime. To mitigate this issue, we propose Swin-Free in which we apply size-varying windows across stages, instead of shifting windows, to achieve cross-connection among local windows. With this simple design change, Swin-Free runs faster than the Swin Transformer at inference with better accuracy. Furthermore, we also propose a few of Swin-Free variants that are faster than their Swin Transformer counterparts.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes