CV LGDec 13, 2022

GPViT: A High Resolution Non-Hierarchical Vision Transformer with Group Propagation

Chenhongyi Yang, Jiarui Xu, Shalini De Mello, Elliot J. Crowley, Xiaolong Wang

arXiv:2212.06795v210.632 citationsh-index: 43Has Code

Originality Highly original

AI Analysis

This work addresses a computational bottleneck for high-resolution visual tasks like detection and segmentation, offering a novel non-hierarchical transformer model that improves efficiency and performance.

The authors tackled the problem of efficiently exchanging global information between high-resolution features in vision transformers, which is computationally expensive due to self-attention scaling, by introducing the Group Propagation Vision Transformer (GPViT) with a Group Propagation Block. The result is significant performance gains across visual recognition tasks, such as GPViT-L3 outperforming Swin Transformer-B by 2.0 mIoU on ADE20K semantic segmentation with half the parameters.

We present the Group Propagation Vision Transformer (GPViT): a novel nonhierarchical (i.e. non-pyramidal) transformer model designed for general visual recognition with high-resolution features. High-resolution features (or tokens) are a natural fit for tasks that involve perceiving fine-grained details such as detection and segmentation, but exchanging global information between these features is expensive in memory and computation because of the way self-attention scales. We provide a highly efficient alternative Group Propagation Block (GP Block) to exchange global information. In each GP Block, features are first grouped together by a fixed number of learnable group tokens; we then perform Group Propagation where global information is exchanged between the grouped features; finally, global information in the updated grouped features is returned back to the image features through a transformer decoder. We evaluate GPViT on a variety of visual recognition tasks including image classification, semantic segmentation, object detection, and instance segmentation. Our method achieves significant performance gains over previous works across all tasks, especially on tasks that require highresolution outputs, for example, our GPViT-L3 outperforms Swin Transformer-B by 2.0 mIoU on ADE20K semantic segmentation with only half as many parameters. Project page: chenhongyiyang.com/projects/GPViT/GPViT

View on arXiv PDF Code

Similar