CVApr 15, 2022

ResT V2: Simpler, Faster and Stronger

arXiv:2204.07366v335 citationsh-index: 14Has Code
AI Analysis

This work addresses the need for more efficient and effective vision backbones in computer vision tasks, but it appears incremental as it builds directly on ResTv1 with specific modifications.

The paper tackles the problem of improving multi-scale vision Transformers for visual recognition by proposing ResTv2, which simplifies the EMSA structure and adds an upsample operation to enhance performance. Experimental results show that ResTv2 outperforms recent state-of-the-art backbones by a large margin on ImageNet classification, COCO detection, and ADE20K semantic segmentation.

This paper proposes ResTv2, a simpler, faster, and stronger multi-scale vision Transformer for visual recognition. ResTv2 simplifies the EMSA structure in ResTv1 (i.e., eliminating the multi-head interaction part) and employs an upsample operation to reconstruct the lost medium- and high-frequency information caused by the downsampling operation. In addition, we explore different techniques for better apply ResTv2 backbones to downstream tasks. We found that although combining EMSAv2 and window attention can greatly reduce the theoretical matrix multiply FLOPs, it may significantly decrease the computation density, thus causing lower actual speed. We comprehensively validate ResTv2 on ImageNet classification, COCO detection, and ADE20K semantic segmentation. Experimental results show that the proposed ResTv2 can outperform the recently state-of-the-art backbones by a large margin, demonstrating the potential of ResTv2 as solid backbones. The code and models will be made publicly available at \url{https://github.com/wofmanaf/ResT}

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes