CVJul 8, 2024

Wavelet Convolutions for Large Receptive Fields

arXiv:2407.05848v2453 citationsh-index: 21Has Code
Originality Incremental advance
AI Analysis

This addresses a key bottleneck in CNNs for computer vision by enabling global receptive fields akin to Vision Transformers, though it is an incremental improvement over existing methods.

The paper tackles the problem of achieving large receptive fields in CNNs without over-parameterization by introducing WTConv, a wavelet-based convolutional layer that scales logarithmically with kernel size, and demonstrates improved performance in image classification and robustness tasks.

In recent years, there have been attempts to increase the kernel size of Convolutional Neural Nets (CNNs) to mimic the global receptive field of Vision Transformers' (ViTs) self-attention blocks. That approach, however, quickly hit an upper bound and saturated way before achieving a global receptive field. In this work, we demonstrate that by leveraging the Wavelet Transform (WT), it is, in fact, possible to obtain very large receptive fields without suffering from over-parameterization, e.g., for a $k \times k$ receptive field, the number of trainable parameters in the proposed method grows only logarithmically with $k$. The proposed layer, named WTConv, can be used as a drop-in replacement in existing architectures, results in an effective multi-frequency response, and scales gracefully with the size of the receptive field. We demonstrate the effectiveness of the WTConv layer within ConvNeXt and MobileNetV2 architectures for image classification, as well as backbones for downstream tasks, and show it yields additional properties such as robustness to image corruption and an increased response to shapes over textures. Our code is available at https://github.com/BGU-CS-VIL/WTConv.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes