CVMay 18

Token-Space Mask Prediction for Efficient Vision Transformer Segmentation

arXiv:2605.1817740.9
Predicted impact top 78% in CV · last 90 daysOriginality Incremental advance
AI Analysis

For practitioners deploying ViT segmentation on embedded systems, TokenMask offers a simpler, more efficient alternative to existing methods.

TokenMask eliminates explicit image-space feature reconstruction in query-based ViT segmentation, computing mask logits directly from query-token affinities. It achieves consistent efficiency gains (e.g., speedups on Jetson AGX Orin with TensorRT FP16) while maintaining competitive accuracy across diverse backbones and tasks.

Query-based Vision Transformer segmentation models typically reconstruct dense spatial feature maps to predict masks, inheriting design patterns from convolutional architectures. We show that this explicit image-space reconstruction is not required. We introduce TokenMask, a token-space mask head that computes mask logits directly from query-token affinities and performs interpolation in logit space rather than feature space. This reformulation preserves the original linear scoring mechanism while simplifying the computational structure. Across diverse ViT backbones, datasets and segmentation tasks, TokenMask consistently improves efficiency over prior approaches by reducing computational and memory requirements while maintaining competitive accuracy, leading to tangible speedups on NVIDIA Jetson AGX Orin using TensorRT FP16 inference. Overall, TokenMask yields a simpler and more deployment-friendly design for embedded vision systems.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes