CVMar 8, 2022

Dynamic Group Transformer: A General Vision Transformer Backbone with Dynamic Group Attention

arXiv:2203.03937v422 citationsh-index: 24
Originality Highly original
AI Analysis

This addresses the computational bottleneck in vision transformers for researchers and practitioners, offering a more efficient and effective alternative to hand-crafted window methods.

The paper tackles the inefficiency of vision transformers by proposing Dynamic Group Attention, which dynamically groups queries to select relevant keys/values, reducing quadratic complexity and outperforming state-of-the-art methods on tasks like image classification and object detection.

Recently, Transformers have shown promising performance in various vision tasks. To reduce the quadratic computation complexity caused by each query attending to all keys/values, various methods have constrained the range of attention within local regions, where each query only attends to keys/values within a hand-crafted window. However, these hand-crafted window partition mechanisms are data-agnostic and ignore their input content, so it is likely that one query maybe attends to irrelevant keys/values. To address this issue, we propose a Dynamic Group Attention (DG-Attention), which dynamically divides all queries into multiple groups and selects the most relevant keys/values for each group. Our DG-Attention can flexibly model more relevant dependencies without any spatial constraint that is used in hand-crafted window based attention. Built on the DG-Attention, we develop a general vision transformer backbone named Dynamic Group Transformer (DGT). Extensive experiments show that our models can outperform the state-of-the-art methods on multiple common vision tasks, including image classification, semantic segmentation, object detection, and instance segmentation.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes