Vision Transformer with Convolutions Architecture Search
This work addresses the need for more robust and versatile vision models, particularly in low-light conditions, though it is incremental as it combines existing Transformer and CNN concepts.
The paper tackles the challenge of enhancing Vision Transformers for complex visual tasks by integrating convolutional features for noise reduction and invariance, proposing VTCAS to search for hybrid architectures. The resulting network achieves 82.0% Top-1 accuracy on ImageNet-1K and 50.4% mAP on COCO2017.
Transformers exhibit great advantages in handling computer vision tasks. They model image classification tasks by utilizing a multi-head attention mechanism to process a series of patches consisting of split images. However, for complex tasks, Transformer in computer vision not only requires inheriting a bit of dynamic attention and global context, but also needs to introduce features concerning noise reduction, shifting, and scaling invariance of objects. Therefore, here we take a step forward to study the structural characteristics of Transformer and convolution and propose an architecture search method-Vision Transformer with Convolutions Architecture Search (VTCAS). The high-performance backbone network searched by VTCAS introduces the desirable features of convolutional neural networks into the Transformer architecture while maintaining the benefits of the multi-head attention mechanism. The searched block-based backbone network can extract feature maps at different scales. These features are compatible with a wider range of visual tasks, such as image classification (32 M parameters, 82.0% Top-1 accuracy on ImageNet-1K) and object detection (50.4% mAP on COCO2017). The proposed topology based on the multi-head attention mechanism and CNN adaptively associates relational features of pixels with multi-scale features of objects. It enhances the robustness of the neural network for object recognition, especially in the low illumination indoor scene.