CVMay 22, 2024

Vision Transformer with Sparse Scan Prior

arXiv:2405.13335v210 citationsh-index: 28MM
Originality Incremental advance
AI Analysis

This addresses the problem of high computational overhead in vision models for researchers and practitioners, offering an incremental improvement over existing Transformer methods.

The paper tackles the computational inefficiency of Vision Transformers by proposing a Sparse Scan Self-Attention mechanism inspired by human eye scanning, resulting in SSViT achieving top-1 accuracies of 84.4%/85.7% on ImageNet with 4.4G/18.2G FLOPs.

In recent years, Transformers have achieved remarkable progress in computer vision tasks. However, their global modeling often comes with substantial computational overhead, in stark contrast to the human eye's efficient information processing. Inspired by the human eye's sparse scanning mechanism, we propose a \textbf{S}parse \textbf{S}can \textbf{S}elf-\textbf{A}ttention mechanism ($\rm{S}^3\rm{A}$). This mechanism predefines a series of Anchors of Interest for each token and employs local attention to efficiently model the spatial information around these anchors, avoiding redundant global modeling and excessive focus on local information. This approach mirrors the human eye's functionality and significantly reduces the computational load of vision models. Building on $\rm{S}^3\rm{A}$, we introduce the \textbf{S}parse \textbf{S}can \textbf{Vi}sion \textbf{T}ransformer (SSViT). Extensive experiments demonstrate the outstanding performance of SSViT across a variety of tasks. Specifically, on ImageNet classification, without additional supervision or training data, SSViT achieves top-1 accuracies of \textbf{84.4\%/85.7\%} with \textbf{4.4G/18.2G} FLOPs. SSViT also excels in downstream tasks such as object detection, instance segmentation, and semantic segmentation. Its robustness is further validated across diverse datasets.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes