CVOct 11, 2024

DeBiFormer: Vision Transformer with Deformable Agent Bi-level Routing Attention

arXiv:2410.08582v114 citationsh-index: 24Has CodeACCV
Originality Incremental advance
AI Analysis

This work addresses a specific bottleneck in vision transformers for computer vision practitioners, offering incremental improvements over existing methods like DAT and BiFormer.

The authors tackled the problem of irrelevant key-value pairs in vision transformer attention mechanisms, which reduces performance in semantic segmentation, by proposing DeBiFormer with a Deformable Bi-level Routing Attention module, achieving strong results on image classification, object detection, and semantic segmentation tasks.

Vision Transformers with various attention modules have demonstrated superior performance on vision tasks. While using sparsity-adaptive attention, such as in DAT, has yielded strong results in image classification, the key-value pairs selected by deformable points lack semantic relevance when fine-tuning for semantic segmentation tasks. The query-aware sparsity attention in BiFormer seeks to focus each query on top-k routed regions. However, during attention calculation, the selected key-value pairs are influenced by too many irrelevant queries, reducing attention on the more important ones. To address these issues, we propose the Deformable Bi-level Routing Attention (DBRA) module, which optimizes the selection of key-value pairs using agent queries and enhances the interpretability of queries in attention maps. Based on this, we introduce the Deformable Bi-level Routing Attention Transformer (DeBiFormer), a novel general-purpose vision transformer built with the DBRA module. DeBiFormer has been validated on various computer vision tasks, including image classification, object detection, and semantic segmentation, providing strong evidence of its effectiveness.Code is available at {https://github.com/maclong01/DeBiFormer}

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes