CVMar 30, 2023

SparseViT: Revisiting Activation Sparsity for Efficient High-Resolution Vision Transformer

MIT
arXiv:2303.17605v190 citationsh-index: 24
Originality Highly original
AI Analysis

This addresses the problem of high computational cost in latency-sensitive applications for vision transformers, offering a practical speedup method.

The paper tackles the computational inefficiency of high-resolution vision transformers by introducing SparseViT, which uses activation sparsity to skip computations in less-important regions, achieving up to 50% latency reduction with 60% sparsity and speedups of 1.5x, 1.4x, and 1.3x in various tasks with minimal accuracy loss.

High-resolution images enable neural networks to learn richer visual representations. However, this improved performance comes at the cost of growing computational complexity, hindering their usage in latency-sensitive applications. As not all pixels are equal, skipping computations for less-important regions offers a simple and effective measure to reduce the computation. This, however, is hard to be translated into actual speedup for CNNs since it breaks the regularity of the dense convolution workload. In this paper, we introduce SparseViT that revisits activation sparsity for recent window-based vision transformers (ViTs). As window attentions are naturally batched over blocks, actual speedup with window activation pruning becomes possible: i.e., ~50% latency reduction with 60% sparsity. Different layers should be assigned with different pruning ratios due to their diverse sensitivities and computational costs. We introduce sparsity-aware adaptation and apply the evolutionary search to efficiently find the optimal layerwise sparsity configuration within the vast search space. SparseViT achieves speedups of 1.5x, 1.4x, and 1.3x compared to its dense counterpart in monocular 3D object detection, 2D instance segmentation, and 2D semantic segmentation, respectively, with negligible to no loss of accuracy.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes