CV LGAug 24, 2022

gSwin: Gated MLP Vision Model with Hierarchical Structure of Shifted Window

arXiv:2208.11718v26.512 citationsh-index: 10

Originality Incremental advance

AI Analysis

This work addresses the need for efficient and high-performance vision models in computer vision tasks, but it is incremental as it builds on existing architectures.

The authors tackled the challenge of improving vision models by combining Swin Transformer and gMLP into gSwin, achieving better accuracy on image classification, object detection, and semantic segmentation with a smaller model size than Swin Transformer.

Following the success in language domain, the self-attention mechanism (transformer) is adopted in the vision domain and achieving great success recently. Additionally, as another stream, multi-layer perceptron (MLP) is also explored in the vision domain. These architectures, other than traditional CNNs, have been attracting attention recently, and many methods have been proposed. As one that combines parameter efficiency and performance with locality and hierarchy in image recognition, we propose gSwin, which merges the two streams; Swin Transformer and (multi-head) gMLP. We showed that our gSwin can achieve better accuracy on three vision tasks, image classification, object detection and semantic segmentation, than Swin Transformer, with smaller model size.

View on arXiv PDF

Similar