CV LG IVJun 5, 2020

Visual Transformers: Token-based Image Representation and Processing for Computer Vision

Bichen Wu, Chenfeng Xu, Xiaoliang Dai, Alvin Wan, Peizhao Zhang, Zhicheng Yan, Masayoshi Tomizuka, Joseph Gonzalez, Kurt Keutzer, Peter Vajda

arXiv:2006.03677v440.1731 citations

Originality Highly original

AI Analysis

This work addresses inefficiencies in computer vision models for tasks like image classification and segmentation, offering a novel paradigm that improves performance and efficiency.

The paper tackles the limitations of convolutional neural networks in computer vision by introducing Visual Transformers, which represent images as semantic tokens and use transformers to model relationships, achieving a 4.6 to 7 point increase in ImageNet top-1 accuracy over ResNet with fewer FLOPs and parameters, and a 0.35 point higher mIoU for semantic segmentation with 6.5x fewer FLOPs.

Computer vision has achieved remarkable success by (a) representing images as uniformly-arranged pixel arrays and (b) convolving highly-localized features. However, convolutions treat all image pixels equally regardless of importance; explicitly model all concepts across all images, regardless of content; and struggle to relate spatially-distant concepts. In this work, we challenge this paradigm by (a) representing images as semantic visual tokens and (b) running transformers to densely model token relationships. Critically, our Visual Transformer operates in a semantic token space, judiciously attending to different image parts based on context. This is in sharp contrast to pixel-space transformers that require orders-of-magnitude more compute. Using an advanced training recipe, our VTs significantly outperform their convolutional counterparts, raising ResNet accuracy on ImageNet top-1 by 4.6 to 7 points while using fewer FLOPs and parameters. For semantic segmentation on LIP and COCO-stuff, VT-based feature pyramid networks (FPN) achieve 0.35 points higher mIoU while reducing the FPN module's FLOPs by 6.5x.

View on arXiv PDF

Similar