HyenaPixel: Global Image Context with Convolutions
This work addresses the computational bottleneck of attention mechanisms in high-resolution image tasks for computer vision researchers, offering a sub-quadratic alternative with competitive performance.
The authors tackled the problem of achieving global context in computer vision without the quadratic complexity of attention by extending Hyena convolutions to bidirectional and 2D image space, scaling kernels up to 191×191 to maximize effective receptive field, and achieved competitive ImageNet-1k top-1 accuracies of 84.9% and 85.2% with HyenaPixel and bidirectional Hyena, respectively, while outperforming other convolutional networks.
In computer vision, a larger effective receptive field (ERF) is associated with better performance. While attention natively supports global context, its quadratic complexity limits its applicability to tasks that benefit from high-resolution input. In this work, we extend Hyena, a convolution-based attention replacement, from causal sequences to bidirectional data and two-dimensional image space. We scale Hyena's convolution kernels beyond the feature map size, up to 191$\times$191, to maximize ERF while maintaining sub-quadratic complexity in the number of pixels. We integrate our two-dimensional Hyena, HyenaPixel, and bidirectional Hyena into the MetaFormer framework. For image categorization, HyenaPixel and bidirectional Hyena achieve a competitive ImageNet-1k top-1 accuracy of 84.9% and 85.2%, respectively, with no additional training data, while outperforming other convolutional and large-kernel networks. Combining HyenaPixel with attention further improves accuracy. We attribute the success of bidirectional Hyena to learning the data-dependent geometric arrangement of pixels without a fixed neighborhood definition. Experimental results on downstream tasks suggest that HyenaPixel with large filters and a fixed neighborhood leads to better localization performance.