CVDec 14, 2023

Factorization Vision Transformer: Modeling Long Range Dependency with Local Window Cost

arXiv:2312.08614v121 citationsh-index: 21Has CodeIEEE Trans Neural Netw Learn Syst
Originality Incremental advance
AI Analysis

This work addresses computational and robustness issues in vision transformers for computer vision applications, representing an incremental improvement over existing methods like Swin.

The paper tackles the computational inefficiency and limited global dependency modeling of local window-based vision transformers by proposing a factorization self-attention mechanism (FaSA), which achieves linear computational complexity and improves classification accuracy by 1% and robustness by 7% over Swin-T while reducing parameters by 14%.

Transformers have astounding representational power but typically consume considerable computation which is quadratic with image resolution. The prevailing Swin transformer reduces computational costs through a local window strategy. However, this strategy inevitably causes two drawbacks: (1) the local window-based self-attention hinders global dependency modeling capability; (2) recent studies point out that local windows impair robustness. To overcome these challenges, we pursue a preferable trade-off between computational cost and performance. Accordingly, we propose a novel factorization self-attention mechanism (FaSA) that enjoys both the advantages of local window cost and long-range dependency modeling capability. By factorizing the conventional attention matrix into sparse sub-attention matrices, FaSA captures long-range dependencies while aggregating mixed-grained information at a computational cost equivalent to the local window-based self-attention. Leveraging FaSA, we present the factorization vision transformer (FaViT) with a hierarchical structure. FaViT achieves high performance and robustness, with linear computational complexity concerning input image spatial resolution. Extensive experiments have shown FaViT's advanced performance in classification and downstream tasks. Furthermore, it also exhibits strong model robustness to corrupted and biased data and hence demonstrates benefits in favor of practical applications. In comparison to the baseline model Swin-T, our FaViT-B2 significantly improves classification accuracy by 1% and robustness by 7%, while reducing model parameters by 14%. Our code will soon be publicly available at https://github.com/q2479036243/FaViT.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes