EAANet: Efficient Attention Augmented Convolutional Networks
This work addresses efficiency bottlenecks in computer vision models for researchers and practitioners, but it is incremental as it builds on existing hybrid architectures.
The paper tackles the high computational and memory complexity of self-attention in hybrid convolutional networks by proposing EAANet, which incorporates efficient self-attention mechanisms, resulting in performance improvements over AA-Net and ResNet18 and better scalability with high-resolution images.
Humans can effectively find salient regions in complex scenes. Self-attention mechanisms were introduced into Computer Vision (CV) to achieve this. Attention Augmented Convolutional Network (AANet) is a mixture of convolution and self-attention, which increases the accuracy of a typical ResNet. However, The complexity of self-attention is O(n2) in terms of computation and memory usage with respect to the number of input tokens. In this project, we propose EAANet: Efficient Attention Augmented Convolutional Networks, which incorporates efficient self-attention mechanisms in a convolution and self-attention hybrid architecture to reduce the model's memory footprint. Our best model show performance improvement over AA-Net and ResNet18. We also explore different methods to augment Convolutional Network with self-attention mechanisms and show the difficulty of training those methods compared to ResNet. Finally, we show that augmenting efficient self-attention mechanisms with ResNet scales better with input size than normal self-attention mechanisms. Therefore, our EAANet is more capable of working with high-resolution images.