Learning Spatially-Adaptive Squeeze-Excitation Networks for Image Synthesis and Image Recognition
This work addresses the problem of efficient deep learning for computer vision researchers and practitioners, offering an incremental improvement by adapting existing SE modules for enhanced data specificity.
The paper tackled the challenge of creating lightweight yet expressive deep networks for image synthesis and recognition by proposing spatially-adaptive squeeze-excitation (SASE) modules, which improved performance over prior methods in low-shot and one-shot learning tasks and achieved better accuracy than vanilla ResNets and slightly better than MHSA counterparts like Swin-Transformer on ImageNet-1000 with smaller models.
Learning light-weight yet expressive deep networks in both image synthesis and image recognition remains a challenging problem. Inspired by a more recent observation that it is the data-specificity that makes the multi-head self-attention (MHSA) in the Transformer model so powerful, this paper proposes to extend the widely adopted light-weight Squeeze-Excitation (SE) module to be spatially-adaptive to reinforce its data specificity, as a convolutional alternative of the MHSA, while retaining the efficiency of SE and the inductive basis of convolution. It presents two designs of spatially-adaptive squeeze-excitation (SASE) modules for image synthesis and image recognition respectively. For image synthesis tasks, the proposed SASE is tested in both low-shot and one-shot learning tasks. It shows better performance than prior arts. For image recognition tasks, the proposed SASE is used as a drop-in replacement for convolution layers in ResNets and achieves much better accuracy than the vanilla ResNets, and slightly better than the MHSA counterparts such as the Swin-Transformer and Pyramid-Transformer in the ImageNet-1000 dataset, with significantly smaller models.