CVFeb 27, 2025

OverLoCK: An Overview-first-Look-Closely-next ConvNet with Context-Mixing Dynamic Kernels

arXiv:2502.20087v324.872 citationsh-index: 5Has CodeCVPR

Originality Highly original

AI Analysis

This work addresses a key limitation in computer vision architectures by mimicking human visual attention, offering improvements in tasks like image classification, object detection, and semantic segmentation, though it is incremental in advancing ConvNet design.

The paper tackles the problem of modern ConvNets lacking a biomimetic top-down attention mechanism by introducing OverLoCK, a pure ConvNet backbone with a branched architecture that incorporates overview-first and look-closely-next principles, achieving significant performance gains such as 84.2% Top-1 accuracy with reduced computational costs.

Top-down attention plays a crucial role in the human vision system, wherein the brain initially obtains a rough overview of a scene to discover salient cues (i.e., overview first), followed by a more careful finer-grained examination (i.e., look closely next). However, modern ConvNets remain confined to a pyramid structure that successively downsamples the feature map for receptive field expansion, neglecting this crucial biomimetic principle. We present OverLoCK, the first pure ConvNet backbone architecture that explicitly incorporates a top-down attention mechanism. Unlike pyramid backbone networks, our design features a branched architecture with three synergistic sub-networks: 1) a Base-Net that encodes low/mid-level features; 2) a lightweight Overview-Net that generates dynamic top-down attention through coarse global context modeling (i.e., overview first); and 3) a robust Focus-Net that performs finer-grained perception guided by top-down attention (i.e., look closely next). To fully unleash the power of top-down attention, we further propose a novel context-mixing dynamic convolution (ContMix) that effectively models long-range dependencies while preserving inherent local inductive biases even when the input resolution increases, addressing critical limitations in existing convolutions. Our OverLoCK exhibits a notable performance improvement over existing methods. For instance, OverLoCK-T achieves a Top-1 accuracy of 84.2%, significantly surpassing ConvNeXt-B while using only around one-third of the FLOPs/parameters. On object detection, our OverLoCK-S clearly surpasses MogaNet-B by 1% in AP^b. On semantic segmentation, our OverLoCK-T remarkably improves UniRepLKNet-T by 1.7% in mIoU. Code is publicly available at https://github.com/LMMMEng/OverLoCK.

View on arXiv PDF Code

Similar