Conv2Former: A Simple Transformer-Style ConvNet for Visual Recognition
This work addresses the need for more efficient visual recognition models for computer vision applications, though it is incremental as it builds on existing ConvNet and Transformer designs.
The paper tackles the problem of efficiently encoding spatial features in visual recognition by proposing Conv2Former, a hierarchical ConvNet that uses convolutional modulation to simplify self-attention, and it outperforms models like Swin Transformer and ConvNeXt on ImageNet, COCO, and ADE20k benchmarks.
This paper does not attempt to design a state-of-the-art method for visual recognition but investigates a more efficient way to make use of convolutions to encode spatial features. By comparing the design principles of the recent convolutional neural networks ConvNets) and Vision Transformers, we propose to simplify the self-attention by leveraging a convolutional modulation operation. We show that such a simple approach can better take advantage of the large kernels (>=7x7) nested in convolutional layers. We build a family of hierarchical ConvNets using the proposed convolutional modulation, termed Conv2Former. Our network is simple and easy to follow. Experiments show that our Conv2Former outperforms existent popular ConvNets and vision Transformers, like Swin Transformer and ConvNeXt in all ImageNet classification, COCO object detection and ADE20k semantic segmentation.