$ShiftwiseConv:$ Small Convolutional Kernel with Large Kernel Effect
This work addresses a bottleneck in CNN design for vision tasks, offering a more efficient alternative to large kernels that could inspire follow-up research.
The paper tackles the problem of diminishing returns from increasing kernel size in CNNs by identifying key hidden factors of large kernels as feature extraction granularity and multi-path fusion, and proposes Shiftwise convolution to achieve comparable effects with 3x3 kernels, surpassing state-of-the-art transformers and CNNs in tasks like classification, segmentation, and detection.
Large kernels make standard convolutional neural networks (CNNs) great again over transformer architectures in various vision tasks. Nonetheless, recent studies meticulously designed around increasing kernel size have shown diminishing returns or stagnation in performance. Thus, the hidden factors of large kernel convolution that affect model performance remain unexplored. In this paper, we reveal that the key hidden factors of large kernels can be summarized as two separate components: extracting features at a certain granularity and fusing features by multiple pathways. To this end, we leverage the multi-path long-distance sparse dependency relationship to enhance feature utilization via the proposed Shiftwise (SW) convolution operator with a pure CNN architecture. In a wide range of vision tasks such as classification, segmentation, and detection, SW surpasses state-of-the-art transformers and CNN architectures, including SLaK and UniRepLKNet. More importantly, our experiments demonstrate that $3 \times 3$ convolutions can replace large convolutions in existing large kernel CNNs to achieve comparable effects, which may inspire follow-up works. Code and all the models at https://github.com/lidc54/shift-wiseConv.