2.0CVApr 27
Selective Attention-Based Network for Robust Infrared Small Target DetectionYingming Zhang, Wuqi Su, Qing Xiao et al.
Infrared small target detection (IRSTD) plays a pivotal role in a broad spectrum of mission-critical applications, including maritime surveillance, military search and rescue, early warning systems, and precision-guided strikes, all of which demand the precise identification of dim, sub-pixel targets amid highly cluttered infrared backgrounds. Despite significant progress driven by deep learning methods, fundamental challenges persist: infrared small targets occupy extremely limited spatial extents (often only a few pixels), exhibit low signal-to-clutter ratios, and are easily confused with structurally complex backgrounds that frequently induce false alarms. Existing encoder-decoder architectures suffer from two key limitations - an information bottleneck in early convolutional stages that undermines fine-grained target perception, and static skip connections that lack the dynamic adaptability required to discriminate between genuine targets and pseudo-target regions. To address these challenges, we propose SANet, a Selective Attention-based Network built upon the classical U-Net framework and augmented with two novel components: (1) a \emph{Dual-path Semantic-aware Module} (DSM) that integrates standard convolutions for local spatial detail preservation with pinwheel-shaped convolutions for expanded, direction-sensitive receptive fields, followed by a Convolutional Block Attention Module (CBAM) for fine-grained spatial-channel feature recalibration; and (2) a \emph{Selective Attention Fusion Module} (SAFM) that replaces conventional static skip connections with a spatially adaptive, learnable weighting mechanism to perform context-aware, cross-scale feature fusion.
11.2CVApr 27
Multi-Branch Non-Homogeneous Image Dehazing via Concentration Partitioning and Image FusionYingming Zhang, Wuqi Su, Qing Xiao et al.
Existing single image dehazing methods have demonstrated satisfactory performance on homogeneous thin-haze images; however, they often struggle with non-homogeneous hazy images that exhibit spatially varying haze concentrations and abrupt density transitions across different regions. To address this fundamental limitation, we propose a novel multi-branch deep neural network framework, termed Concentration Partitioning and Image Fusion Network (CPIFNet), which decomposes the challenging non-homogeneous dehazing problem into a set of tractable homogeneous sub-problems. Our key insight is that a single non-homogeneous hazy image can be viewed as a composite of multiple local regions, each exhibiting approximately homogeneous haze characteristics. CPIFNet employs a two-stage architecture consisting of an Image Enhancement Network (IENet) stage and an Image Fusion Network (IFNet) stage. In the first stage, multiple IENet branches are independently trained on homogeneous haze datasets of different concentration levels, producing enhancement models that excel at restoring regions matching their respective haze densities. In the second stage, the IFNet intelligently aggregates the advantageous regions from all enhancement outputs through deep feature stacking and merging, yielding a unified high-quality dehazed result. Furthermore, we introduce a comprehensive loss function incorporating reconstruction, perceptual, structural, and color losses to jointly supervise both stages.
59.8CVApr 3
Hierarchical Awareness Adapters with Hybrid Pyramid Feature Fusion for Dense Depth PredictionWuqi Su, Huilun Song, Chen Zhao et al.
Monocular depth estimation from a single RGB image remains a fundamental challenge in computer vision due to inherent scale ambiguity and the absence of explicit geometric cues. Existing approaches typically rely on increasingly complex network architectures to regress depth maps, which escalates training costs and computational overhead without fully exploiting inter-pixel spatial dependencies. We propose a multilevel perceptual conditional random field (CRF) model built upon the Swin Transformer backbone that addresses these limitations through three synergistic innovations: (1) an adaptive hybrid pyramid feature fusion (HPF) strategy that captures both short-range and long-range dependencies by combining multi-scale spatial pyramid pooling with biaxial feature aggregation, enabling effective integration of global and local contextual information; (2) a hierarchical awareness adapter (HA) that enriches cross-level feature interactions within the encoder through lightweight broadcast modules with learnable dimensional scaling, reducing computational complexity while enhancing representational capacity; and (3) a fully-connected CRF decoder with dynamic scaling attention that models fine-grained pixel-level spatial relationships, incorporating a bias learning unit to prevent extreme-value collapse and ensure stable training. Extensive experiments on NYU Depth v2, KITTI, and MatterPort3D datasets demonstrate that our method achieves state-of-the-art performance, reducing Abs Rel to 0.088 ($-$7.4\%) and RMSE to 0.316 ($-$5.4\%) on NYU Depth v2, while attaining near-perfect threshold accuracy ($δ< 1.25^3 \approx 99.8\%$) on KITTI with only 194M parameters and 21ms inference time.