CVAug 3, 2023

Dynamic Token-Pass Transformers for Semantic Segmentation

arXiv:2308.01944v114 citationsh-index: 60
Originality Incremental advance
AI Analysis

This work addresses efficiency issues for researchers and practitioners using vision transformers in semantic segmentation, offering a method to reduce computational overhead while maintaining accuracy, though it is incremental as it builds on existing transformer architectures.

The paper tackles the problem of high computational cost in vision transformers for semantic segmentation by introducing dynamic token-pass transformers (DoViT), which adaptively reduce inference cost by stopping easy tokens from self-attention, resulting in a 40-60% reduction in FLOPs with less than 0.8% drop in mIoU and over 2x speed increase on Cityscapes.

Vision transformers (ViT) usually extract features via forwarding all the tokens in the self-attention layers from top to toe. In this paper, we introduce dynamic token-pass vision transformers (DoViT) for semantic segmentation, which can adaptively reduce the inference cost for images with different complexity. DoViT gradually stops partial easy tokens from self-attention calculation and keeps the hard tokens forwarding until meeting the stopping criteria. We employ lightweight auxiliary heads to make the token-pass decision and divide the tokens into keeping/stopping parts. With a token separate calculation, the self-attention layers are speeded up with sparse tokens and still work friendly with hardware. A token reconstruction module is built to collect and reset the grouped tokens to their original position in the sequence, which is necessary to predict correct semantic masks. We conduct extensive experiments on two common semantic segmentation tasks, and demonstrate that our method greatly reduces about 40% $\sim$ 60% FLOPs and the drop of mIoU is within 0.8% for various segmentation transformers. The throughput and inference speed of ViT-L/B are increased to more than 2$\times$ on Cityscapes.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes