CVLGMay 27, 2022

X-ViT: High Performance Linear Vision Transformer without Softmax

arXiv:2205.13805v13 citationsh-index: 4
Originality Highly original
AI Analysis

This addresses a major bottleneck in vision transformers for computer vision applications, offering a more efficient alternative with broad impact.

The paper tackles the quadratic computational complexity of self-attention in vision transformers by proposing X-ViT, a linear-complexity variant that eliminates nonlinearity and modifies few lines of code, achieving superior performance on image classification and dense prediction tasks across most capacity regimes.

Vision transformers have become one of the most important models for computer vision tasks. Although they outperform prior works, they require heavy computational resources on a scale that is quadratic to the number of tokens, $N$. This is a major drawback of the traditional self-attention (SA) algorithm. Here, we propose the X-ViT, ViT with a novel SA mechanism that has linear complexity. The main approach of this work is to eliminate nonlinearity from the original SA. We factorize the matrix multiplication of the SA mechanism without complicated linear approximation. By modifying only a few lines of code from the original SA, the proposed models outperform most transformer-based models on image classification and dense prediction tasks on most capacity regimes.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes