Transformed CNNs: recasting pre-trained convolutional layers with self-attention
This work addresses the training efficiency and performance issues in hybrid vision models for computer vision researchers, offering an incremental improvement by reusing pre-trained CNNs.
The paper tackles the computational bottleneck of self-attention in hybrid vision models by initializing self-attention layers as convolutional layers from pre-trained CNNs, enabling smooth transitions to Transformed CNNs (T-CNNs). With only 50 epochs of fine-tuning, T-CNNs achieve significant performance gains, such as +2.2% top-1 accuracy on ImageNet-1k for ResNet50-RS and +11% top-1 on ImageNet-C for improved robustness.
Vision Transformers (ViT) have recently emerged as a powerful alternative to convolutional networks (CNNs). Although hybrid models attempt to bridge the gap between these two architectures, the self-attention layers they rely on induce a strong computational bottleneck, especially at large spatial resolutions. In this work, we explore the idea of reducing the time spent training these layers by initializing them as convolutional layers. This enables us to transition smoothly from any pre-trained CNN to its functionally identical hybrid model, called Transformed CNN (T-CNN). With only 50 epochs of fine-tuning, the resulting T-CNNs demonstrate significant performance gains over the CNN (+2.2% top-1 on ImageNet-1k for a ResNet50-RS) as well as substantially improved robustness (+11% top-1 on ImageNet-C). We analyze the representations learnt by the T-CNN, providing deeper insights into the fruitful interplay between convolutions and self-attention. Finally, we experiment initializing the T-CNN from a partially trained CNN, and find that it reaches better performance than the corresponding hybrid model trained from scratch, while reducing training time.