CVApr 7, 2023

SparseFormer: Sparse Visual Recognition via Limited Latent Tokens

Ziteng Gao, Zhan Tong, Limin Wang, Mike Zheng Shou

arXiv:2304.03768v19.118 citationsh-index: 65Has Code

Originality Incremental advance

AI Analysis

This work addresses computational inefficiency in vision models for researchers and practitioners by introducing a sparse alternative, though it is incremental as it builds on existing token-based methods.

The paper tackles the dense processing paradigm in vision networks by proposing SparseFormer, which uses a limited number of latent tokens (down to 49) for sparse visual recognition, achieving performance on par with established models on ImageNet while offering better accuracy-throughput tradeoff.

Human visual recognition is a sparse process, where only a few salient visual cues are attended to rather than traversing every detail uniformly. However, most current vision networks follow a dense paradigm, processing every single visual unit (e.g,, pixel or patch) in a uniform manner. In this paper, we challenge this dense paradigm and present a new method, coined SparseFormer, to imitate human's sparse visual recognition in an end-to-end manner. SparseFormer learns to represent images using a highly limited number of tokens (down to 49) in the latent space with sparse feature sampling procedure instead of processing dense units in the original pixel space. Therefore, SparseFormer circumvents most of dense operations on the image space and has much lower computational costs. Experiments on the ImageNet classification benchmark dataset show that SparseFormer achieves performance on par with canonical or well-established models while offering better accuracy-throughput tradeoff. Moreover, the design of our network can be easily extended to the video classification with promising performance at lower computational costs. We hope that our work can provide an alternative way for visual modeling and inspire further research on sparse neural architectures. The code will be publicly available at https://github.com/showlab/sparseformer

View on arXiv PDF Code

Similar