CV LGDec 1, 2024

Token Cropr: Faster ViTs for Quite a Few Tasks

Benjamin Bergner, Christoph Lippert, Aravindh Mahendran

arXiv:2412.00965v112.18 citationsh-index: 5Has CodeCVPR

Originality Incremental advance

AI Analysis

This work addresses efficiency for ViTs in various vision tasks, but it is incremental as it builds on existing token reduction methods.

The paper tackled the problem of improving inference throughput for Vision Transformers (ViTs) in resource-constrained applications by proposing a token pruner that uses auxiliary prediction heads to select tokens based on task relevance, achieving speedups of 1.5 to 4x with small performance drops, such as a 2x speedup with a 0.1 median mIoU penalty on ADE20k semantic segmentation.

The adoption of Vision Transformers (ViTs) in resource-constrained applications necessitates improvements in inference throughput. To this end several token pruning and merging approaches have been proposed that improve efficiency by successively reducing the number of tokens. However, it remains an open problem to design a token reduction method that is fast, maintains high performance, and is applicable to various vision tasks. In this work, we present a token pruner that uses auxiliary prediction heads that learn to select tokens end-to-end based on task relevance. These auxiliary heads can be removed after training, leading to throughput close to that of a random pruner. We evaluate our method on image classification, semantic segmentation, object detection, and instance segmentation, and show speedups of 1.5 to 4x with small drops in performance. As a best case, on the ADE20k semantic segmentation benchmark, we observe a 2x speedup relative to the no-pruning baseline, with a negligible performance penalty of 0.1 median mIoU across 5 seeds.

View on arXiv PDF Code

Similar