CVLGNEApr 13, 2021

Co-Scale Conv-Attentional Image Transformers

arXiv:2104.06399v2453 citations
AI Analysis

This work addresses image classification and vision tasks for computer vision researchers, presenting an incremental improvement by combining existing mechanisms in a novel way.

The paper tackled the problem of improving image classification by introducing CoaT, a Transformer-based model with co-scale and conv-attentional mechanisms, achieving superior classification results on ImageNet compared to similar-sized CNNs and Transformers, and demonstrating effectiveness in downstream tasks like object detection and instance segmentation.

In this paper, we present Co-scale conv-attentional image Transformers (CoaT), a Transformer-based image classifier equipped with co-scale and conv-attentional mechanisms. First, the co-scale mechanism maintains the integrity of Transformers' encoder branches at individual scales, while allowing representations learned at different scales to effectively communicate with each other; we design a series of serial and parallel blocks to realize the co-scale mechanism. Second, we devise a conv-attentional mechanism by realizing a relative position embedding formulation in the factorized attention module with an efficient convolution-like implementation. CoaT empowers image Transformers with enriched multi-scale and contextual modeling capabilities. On ImageNet, relatively small CoaT models attain superior classification results compared with similar-sized convolutional neural networks and image/vision Transformers. The effectiveness of CoaT's backbone is also illustrated on object detection and instance segmentation, demonstrating its applicability to downstream computer vision tasks.

Code Implementations9 repos

Data from Papers with Code (CC-BY-SA-4.0)

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes