CVAug 31, 2022

MAFormer: A Transformer Network with Multi-scale Attention Fusion for Visual Recognition

Yunhao Wang, Huixin Sun, Xiaodi Wang, Bin Zhang, Chao Li, Ying Xin, Baochang Zhang, Errui Ding, Shumin Han

arXiv:2209.01620v12.626 citationsh-index: 60

Originality Incremental advance

AI Analysis

This work addresses a key bottleneck in visual recognition for computer vision applications, offering incremental improvements over existing transformer variants.

The paper tackles the problem of vision transformers struggling with global relationships and fine-grained representation by introducing MAFormer, a transformer network with multi-scale attention fusion, which achieves state-of-the-art performance, such as 85.9% Top-1 accuracy on ImageNet and improvements of 1.7% mAP on object detection and 1.4% on instance segmentation on MSCOCO.

Vision Transformer and its variants have demonstrated great potential in various computer vision tasks. But conventional vision transformers often focus on global dependency at a coarse level, which suffer from a learning challenge on global relationships and fine-grained representation at a token level. In this paper, we introduce Multi-scale Attention Fusion into transformer (MAFormer), which explores local aggregation and global feature extraction in a dual-stream framework for visual recognition. We develop a simple but effective module to explore the full potential of transformers for visual representation by learning fine-grained and coarse-grained features at a token level and dynamically fusing them. Our Multi-scale Attention Fusion (MAF) block consists of: i) a local window attention branch that learns short-range interactions within windows, aggregating fine-grained local features; ii) global feature extraction through a novel Global Learning with Down-sampling (GLD) operation to efficiently capture long-range context information within the whole image; iii) a fusion module that self-explores the integration of both features via attention. Our MAFormer achieves state-of-the-art performance on common vision tasks. In particular, MAFormer-L achieves 85.9$\%$ Top-1 accuracy on ImageNet, surpassing CSWin-B and LV-ViT-L by 1.7$\%$ and 0.6$\%$ respectively. On MSCOCO, MAFormer outperforms the prior art CSWin by 1.7$\%$ mAPs on object detection and 1.4$\%$ on instance segmentation with similar-sized parameters, demonstrating the potential to be a general backbone network.

View on arXiv PDF

Similar