CVDec 16, 2022Code
One-Stage Cascade Refinement Networks for Infrared Small Target DetectionYimian Dai, Xiang Li, Fei Zhou et al.
Single-frame InfraRed Small Target (SIRST) detection has been a challenging task due to a lack of inherent characteristics, imprecise bounding box regression, a scarcity of real-world datasets, and sensitive localization evaluation. In this paper, we propose a comprehensive solution to these challenges. First, we find that the existing anchor-free label assignment method is prone to mislabeling small targets as background, leading to their omission by detectors. To overcome this issue, we propose an all-scale pseudo-box-based label assignment scheme that relaxes the constraints on scale and decouples the spatial assignment from the size of the ground-truth target. Second, motivated by the structured prior of feature pyramids, we introduce the one-stage cascade refinement network (OSCAR), which uses the high-level head as soft proposals for the low-level refinement head. This allows OSCAR to process the same target in a cascade coarse-to-fine manner. Finally, we present a new research benchmark for infrared small target detection, consisting of the SIRST-V2 dataset of real-world, high-resolution single-frame targets, the normalized contrast evaluation metric, and the DeepInfrared toolkit for detection. We conduct extensive ablation studies to evaluate the components of OSCAR and compare its performance to state-of-the-art model-driven and data-driven methods on the SIRST-V2 benchmark. Our results demonstrate that a top-down cascade refinement framework can improve the accuracy of infrared small target detection without sacrificing efficiency. The DeepInfrared toolkit, dataset, and trained models are available at https://github.com/YimianDai/open-deepinfrared to advance further research in this field.
CVAug 7, 2024Code
Pick of the Bunch: Detecting Infrared Small Targets Beyond Hit-Miss Trade-Offs via Selective Rank-Aware AttentionYimian Dai, Peiwen Pan, Yulei Qian et al.
Infrared small target detection faces the inherent challenge of precisely localizing dim targets amidst complex background clutter. Traditional approaches struggle to balance detection precision and false alarm rates. To break this dilemma, we propose SeRankDet, a deep network that achieves high accuracy beyond the conventional hit-miss trade-off, by following the ``Pick of the Bunch'' principle. At its core lies our Selective Rank-Aware Attention (SeRank) module, employing a non-linear Top-K selection process that preserves the most salient responses, preventing target signal dilution while maintaining constant complexity. Furthermore, we replace the static concatenation typical in U-Net structures with our Large Selective Feature Fusion (LSFF) module, a dynamic fusion strategy that empowers SeRankDet with adaptive feature integration, enhancing its ability to discriminate true targets from false alarms. The network's discernment is further refined by our Dilated Difference Convolution (DDC) module, which merges differential convolution aimed at amplifying subtle target characteristics with dilated convolution to expand the receptive field, thereby substantially improving target-background separation. Despite its lightweight architecture, the proposed SeRankDet sets new benchmarks in state-of-the-art performance across multiple public datasets. The code is available at https://github.com/GrokCV/SeRankDet.
CVSep 30, 2024Code
HazyDet: Open-Source Benchmark for Drone-View Object Detection with Depth-Cues in Hazy ScenesChangfeng Feng, Zhenyuan Chen, Xiang Li et al.
Object detection from aerial platforms under adverse atmospheric conditions, particularly haze, is paramount for robust drone autonomy. Yet, this domain remains largely underexplored, primarily hindered by the absence of specialized benchmarks. To bridge this gap, we present \textit{HazyDet}, the first, large-scale benchmark specifically designed for drone-view object detection in hazy conditions. Comprising 383,000 real-world instances derived from both naturally hazy captures and synthetically hazed scenes augmented from clear images, HazyDet provides a challenging and realistic testbed for advancing detection algorithms. To address the severe visual degradation induced by haze, we propose the Depth-Conditioned Detector (DeCoDet), a novel architecture that integrates a Depth-Conditioned Kernel to dynamically modulate feature representations based on depth cues. The practical efficacy and robustness of DeCoDet are further enhanced by its training with a Progressive Domain Fine-Tuning (PDFT) strategy to navigate synthetic-to-real domain shifts, and a Scale-Invariant Refurbishment Loss (SIRLoss) to ensure resilient learning from potentially noisy depth annotations. Comprehensive empirical validation on HazyDet substantiates the superiority of our unified DeCoDet framework, which achieves state-of-the-art performance, surpassing the closest competitor by a notable +1.5\% mAP on challenging real-world hazy test scenarios. Our dataset and toolkit are available at https://github.com/GrokCV/HazyDet.
CVJul 29, 2024Code
Background Semantics Matter: Cross-Task Feature Exchange Network for Clustered Infrared Small Target DetectionMengxuan Xiao, Yinfei Zhu, Yiming Zhu et al.
Infrared small target detection presents significant challenges due to the limited intrinsic features of the target and the overwhelming presence of visually similar background distractors. We contend that background semantics are critical for distinguishing between objects that appear visually similar in this context. To address this challenge, we propose a task, clustered infrared small target detection, and introduce DenseSIRST, a benchmark dataset that provides per-pixel semantic annotations for background regions. This dataset facilitates the shift from sparse to dense target detection. This dataset facilitates the shift from sparse to dense target detection. Building on this resource, we propose the Background-Aware Feature Exchange Network (BAFE-Net), a multi-task architecture that jointly tackles target detection and background semantic segmentation. BAFE-Net incorporates a dynamic cross-task feature hard-exchange mechanism, enabling the effective exchange of target and background semantics between the two tasks. Comprehensive experiments demonstrate that BAFE-Net significantly enhances target detection accuracy while mitigating false alarms. The DenseSIRST dataset, along with the code and trained models, is publicly available at https://github.com/GrokCV/BAFE-Net.
CVJul 22, 2024Code
Sparse Prior Is Not All You Need: When Differential Directionality Meets Saliency Coherence for Infrared Small Target DetectionFei Zhou, Maixia Fu, Yulei Qian et al.
Infrared small target detection is crucial for the efficacy of infrared search and tracking systems. Current tensor decomposition methods emphasize representing small targets with sparsity but struggle to separate targets from complex backgrounds due to insufficient use of intrinsic directional information and reduced target visibility during decomposition. To address these challenges, this study introduces a Sparse Differential Directionality prior (SDD) framework. SDD leverages the distinct directional characteristics of targets to differentiate them from the background, applying mixed sparse constraints on the differential directional images and continuity difference matrix of the temporal component, both derived from Tucker decomposition. We further enhance target detectability with a saliency coherence strategy that intensifies target contrast against the background during hierarchical decomposition. A Proximal Alternating Minimization-based (PAM) algorithm efficiently solves our proposed model. Experimental results on several real-world datasets validate our method's effectiveness, outperforming ten state-of-the-art methods in target detection and clutter suppression. Our code is available at https://github.com/GrokCV/SDD.
69.0CVMay 26Code
RoadGIE: Towards A Global-Scale Aerial Benchmark for Generalizable Interactive Road ExtractionChenxu Peng, Chenxu Wang, Yimian Dai et al.
Accurate road segmentation from aerial imagery is fundamental to many geospatial applications. However, existing datasets often suffer from limited scene diversity, low semantic granularity, and poor structural continuity, restricting their generalization across environments. To address these challenges, we introduce WorldRoadSeg-360K, the largest and most diverse road segmentation dataset to date, comprising 366,947 high-resolution images collected from 38 countries and 223 cities across various terrains and continents. WorldRoadSeg-360K serves as a comprehensive benchmark and reveals key challenges in handling diverse and structurally complex scenes. Automated approaches often struggle to preserve road connectivity, while current interactive methods lack efficient, topology-sensitive tools for real-world road editing. To this end, we present RoadGIE, establishing a novel interactive paradigm for road extraction in remote sensing. Unlike prior point- or box-based prompting strategies, RoadGIE supports connectivity-aware prompts, including clicks and scribbles, which inherently align with the topology of road networks. To improve structural consistency and mitigate performance degradation during iterative interactions, RoadGIE integrates an expert-guided prompting strategy and adapts the skeleton-based recall loss for interactive scenarios. RoadGIE achieves state-of-the-art performance in both segmentation accuracy and topological consistency on WorldRoadSeg-360K and other benchmarks, while maintaining efficient operation with only 3.7M parameters. The code are publicly available at: https://github.com/chaineypung/RoadGIE
CVJul 29, 2024Code
Towards Robust Infrared Small Target Detection: A Feature-Enhanced and Sensitivity-Tunable FrameworkJinmiao Zhao, Zelin Shi, Chuang Yu et al.
Recently, single-frame infrared small target (SIRST) detection technology has attracted widespread attention. Different from most existing deep learning-based methods that focus on improving network architectures, we propose a feature-enhanced and sensitivity-tunable (FEST) framework, which is compatible with existing SIRST detection networks and further enhances their detection performance. The FEST framework improves the model's robustness from two aspects: feature enhancement and target confidence regulation. For feature enhancement, we employ a multi-scale fusion strategy to improve the model's perception to multi-scale features of multi-size targets, and design an edge enhancement difficulty mining (EEDM) loss to guide the network to continuously focus on challenging target regions and edge features during training. For target confidence regulation, an adjustable sensitivity (AS) strategy is proposed for network post-processing. This strategy enhances the model's adaptability in complex scenarios and significantly improves the detection rate of infrared small targets while maintaining segmentation accuracy. Extensive experimental results show that our FEST framework can effectively enhance the performance of existing SIRST detection networks. The code is available at https://github.com/YuChuang1205/FEST-Framework
CVSep 30, 2024Code
MoCoLSK: Modality Conditioned High-Resolution Downscaling for Land Surface TemperatureQun Dai, Chunyang Yuan, Yimian Dai et al.
Land Surface Temperature (LST) is a critical parameter for environmental studies, but directly obtaining high spatial resolution LST data remains challenging due to the spatio-temporal trade-off in satellite remote sensing. Guided LST downscaling has emerged as an alternative solution to overcome these limitations, but current methods often neglect spatial non-stationarity, and there is a lack of an open-source ecosystem for deep learning methods. In this paper, we propose the Modality-Conditional Large Selective Kernel (MoCoLSK) Network, a novel architecture that dynamically fuses multi-modal data through modality-conditioned projections. MoCoLSK achieves a confluence of dynamic receptive field adjustment and multi-modal feature fusion, leading to enhanced LST prediction accuracy. Furthermore, we establish the GrokLST project, a comprehensive open-source ecosystem featuring the GrokLST dataset, a high-resolution benchmark, and the GrokLST toolkit, an open-source PyTorch-based toolkit encapsulating MoCoLSK alongside 40+ state-of-the-art approaches. Extensive experimental results validate MoCoLSK's effectiveness in capturing complex dependencies and subtle variations within multispectral data, outperforming existing methods in LST downscaling. Our code, dataset, and toolkit are available at https://github.com/GrokCV/GrokLST.
CVMar 18, 2024Code
LSKNet: A Foundation Lightweight Backbone for Remote SensingYuxuan Li, Xiang Li, Yimian Dai et al.
Remote sensing images pose distinct challenges for downstream tasks due to their inherent complexity. While a considerable amount of research has been dedicated to remote sensing classification, object detection and semantic segmentation, most of these studies have overlooked the valuable prior knowledge embedded within remote sensing scenarios. Such prior knowledge can be useful because remote sensing objects may be mistakenly recognized without referencing a sufficiently long-range context, which can vary for different objects. This paper considers these priors and proposes a lightweight Large Selective Kernel Network (LSKNet) backbone. LSKNet can dynamically adjust its large spatial receptive field to better model the ranging context of various objects in remote sensing scenarios. To our knowledge, large and selective kernel mechanisms have not been previously explored in remote sensing images. Without bells and whistles, our lightweight LSKNet sets new state-of-the-art scores on standard remote sensing classification, object detection and semantic segmentation benchmarks. Our comprehensive analysis further validated the significance of the identified priors and the effectiveness of LSKNet. The code is available at https://github.com/zcablii/LSKNet.
CVDec 30, 2024Code
SM3Det: A Unified Model for Multi-Modal Remote Sensing Object DetectionYuxuan Li, Xiang Li, Yunheng Li et al.
With the rapid advancement of remote sensing technology, high-resolution multi-modal imagery is now more widely accessible. Conventional Object detection models are trained on a single dataset, often restricted to a specific imaging modality and annotation format. However, such an approach overlooks the valuable shared knowledge across multi-modalities and limits the model's applicability in more versatile scenarios. This paper introduces a new task called Multi-Modal Datasets and Multi-Task Object Detection (M2Det) for remote sensing, designed to accurately detect horizontal or oriented objects from any sensor modality. This task poses challenges due to 1) the trade-offs involved in managing multi-modal modelling and 2) the complexities of multi-task optimization. To address these, we establish a benchmark dataset and propose a unified model, SM3Det (Single Model for Multi-Modal datasets and Multi-Task object Detection). SM3Det leverages a grid-level sparse MoE backbone to enable joint knowledge learning while preserving distinct feature representations for different modalities. Furthermore, it integrates a consistency and synchronization optimization strategy using dynamic learning rate adjustment, allowing it to effectively handle varying levels of learning difficulty across modalities and tasks. Extensive experiments demonstrate SM3Det's effectiveness and generalizability, consistently outperforming specialized models on individual datasets. The code is available at https://github.com/zcablii/SM3Det.
CVJul 13, 2025Code
DRPCA-Net: Make Robust PCA Great Again for Infrared Small Target DetectionZihao Xiong, Fei Zhou, Fengyi Wu et al.
Infrared small target detection plays a vital role in remote sensing, industrial monitoring, and various civilian applications. Despite recent progress powered by deep learning, many end-to-end convolutional models tend to pursue performance by stacking increasingly complex architectures, often at the expense of interpretability, parameter efficiency, and generalization. These models typically overlook the intrinsic sparsity prior of infrared small targets--an essential cue that can be explicitly modeled for both performance and efficiency gains. To address this, we revisit the model-based paradigm of Robust Principal Component Analysis (RPCA) and propose Dynamic RPCA Network (DRPCA-Net), a novel deep unfolding network that integrates the sparsity-aware prior into a learnable architecture. Unlike conventional deep unfolding methods that rely on static, globally learned parameters, DRPCA-Net introduces a dynamic unfolding mechanism via a lightweight hypernetwork. This design enables the model to adaptively generate iteration-wise parameters conditioned on the input scene, thereby enhancing its robustness and generalization across diverse backgrounds. Furthermore, we design a Dynamic Residual Group (DRG) module to better capture contextual variations within the background, leading to more accurate low-rank estimation and improved separation of small targets. Extensive experiments on multiple public infrared datasets demonstrate that DRPCA-Net significantly outperforms existing state-of-the-art methods in detection accuracy. Code is available at https://github.com/GrokCV/DRPCA-Net.
CVSep 22, 2025Code
Visual Instruction Pretraining for Domain-Specific Foundation ModelsYuxuan Li, Yicheng Zhang, Wenhao Tang et al.
Modern computer vision is converging on a closed loop in which perception, reasoning and generation mutually reinforce each other. However, this loop remains incomplete: the top-down influence of high-level reasoning on the foundational learning of low-level perceptual features is not yet underexplored. This paper addresses this gap by proposing a new paradigm for pretraining foundation models in downstream domains. We introduce Visual insTruction Pretraining (ViTP), a novel approach that directly leverages reasoning to enhance perception. ViTP embeds a Vision Transformer (ViT) backbone within a Vision-Language Model and pretrains it end-to-end using a rich corpus of visual instruction data curated from target downstream domains. ViTP is powered by our proposed Visual Robustness Learning (VRL), which compels the ViT to learn robust and domain-relevant features from a sparse set of visual tokens. Extensive experiments on 16 challenging remote sensing and medical imaging benchmarks demonstrate that ViTP establishes new state-of-the-art performance across a diverse range of downstream tasks. The code is available at https://github.com/zcablii/ViTP.
CVJun 30, 2025Code
Event-based Tiny Object Detection: A Benchmark Dataset and BaselineNuo Chen, Chao Xiao, Yimian Dai et al.
Small object detection (SOD) in anti-UAV task is a challenging problem due to the small size of UAVs and complex backgrounds. Traditional frame-based cameras struggle to detect small objects in complex environments due to their low frame rates, limited dynamic range, and data redundancy. Event cameras, with microsecond temporal resolution and high dynamic range, provide a more effective solution for SOD. However, existing event-based object detection datasets are limited in scale, feature large targets size, and lack diverse backgrounds, making them unsuitable for SOD benchmarks. In this paper, we introduce a Event-based Small object detection (EVSOD) dataset (namely EV-UAV), the first large-scale, highly diverse benchmark for anti-UAV tasks. It includes 147 sequences with over 2.3 million event-level annotations, featuring extremely small targets (averaging 6.8 $\times$ 5.4 pixels) and diverse scenarios such as urban clutter and extreme lighting conditions. Furthermore, based on the observation that small moving targets form continuous curves in spatiotemporal event point clouds, we propose Event based Sparse Segmentation Network (EV-SpSegNet), a novel baseline for event segmentation in point cloud space, along with a Spatiotemporal Correlation (STC) loss that leverages motion continuity to guide the network in retaining target events. Extensive experiments on the EV-UAV dataset demonstrate the superiority of our method and provide a benchmark for future research in EVSOD. The dataset and code are at https://github.com/ChenYichen9527/Ev-UAV.
CVAug 12, 2025Code
DenoDet V2: Phase-Amplitude Cross Denoising for SAR Object DetectionKang Ni, Minrui Zou, Yuxuan Li et al.
One of the primary challenges in Synthetic Aperture Radar (SAR) object detection lies in the pervasive influence of coherent noise. As a common practice, most existing methods, whether handcrafted approaches or deep learning-based methods, employ the analysis or enhancement of object spatial-domain characteristics to achieve implicit denoising. In this paper, we propose DenoDet V2, which explores a completely novel and different perspective to deconstruct and modulate the features in the transform domain via a carefully designed attention architecture. Compared to DenoDet V1, DenoDet V2 is a major advancement that exploits the complementary nature of amplitude and phase information through a band-wise mutual modulation mechanism, which enables a reciprocal enhancement between phase and amplitude spectra. Extensive experiments on various SAR datasets demonstrate the state-of-the-art performance of DenoDet V2. Notably, DenoDet V2 achieves a significant 0.8\% improvement on SARDet-100K dataset compared to DenoDet V1, while reducing the model complexity by half. The code is available at https://github.com/GrokCV/GrokSAR.
CVJun 15, 2025Code
Probing Deep into Temporal Profile Makes the Infrared Small Target Detector Much BetterRuojing Li, Wei An, Xinyi Ying et al.
Infrared small target (IRST) detection is challenging in simultaneously achieving precise, universal, robust and efficient performance due to extremely dim targets and strong interference. Current learning-based methods attempt to leverage ``more" information from both the spatial and the short-term temporal domains, but suffer from unreliable performance under complex conditions while incurring computational redundancy. In this paper, we explore the ``more essential" information from a more crucial domain for the detection. Through theoretical analysis, we reveal that the global temporal saliency and correlation information in the temporal profile demonstrate significant superiority in distinguishing target signals from other signals. To investigate whether such superiority is preferentially leveraged by well-trained networks, we built the first prediction attribution tool in this field and verified the importance of the temporal profile information. Inspired by the above conclusions, we remodel the IRST detection task as a one-dimensional signal anomaly detection task, and propose an efficient deep temporal probe network (DeepPro) that only performs calculations in the time dimension for IRST detection. We conducted extensive experiments to fully validate the effectiveness of our method. The experimental results are exciting, as our DeepPro outperforms existing state-of-the-art IRST detection methods on widely-used benchmarks with extremely high efficiency, and achieves a significant improvement on dim targets and in complex scenarios. We provide a new modeling domain, a new insight, a new method, and a new performance, which can promote the development of IRST detection. Codes are available at https://github.com/TinaLRJ/DeepPro.
CVMay 21, 2025Code
AuxDet: Auxiliary Metadata Matters for Omni-Domain Infrared Small Target DetectionYangting Shi, Yinfei Zhu, Renjie He et al.
Omni-domain infrared small target detection (Omni-IRSTD) poses formidable challenges, as a single model must seamlessly adapt to diverse imaging systems, varying resolutions, and multiple spectral bands simultaneously. Current approaches predominantly rely on visual-only modeling paradigms that not only struggle with complex background interference and inherently scarce target features, but also exhibit limited generalization capabilities across complex omni-scene environments where significant domain shifts and appearance variations occur. In this work, we reveal a critical oversight in existing paradigms: the neglect of readily available auxiliary metadata describing imaging parameters and acquisition conditions, such as spectral bands, sensor platforms, resolution, and observation perspectives. To address this limitation, we propose the Auxiliary Metadata Driven Infrared Small Target Detector (AuxDet), a novel multimodal framework that is the first to incorporate metadata into the IRSTD paradigm for scene-aware optimization. Through a high-dimensional fusion module based on multi-layer perceptrons (MLPs), AuxDet dynamically integrates metadata semantics with visual features, guiding adaptive representation learning for each individual sample. Additionally, we design a lightweight prior-initialized enhancement module using 1D convolutional blocks to further refine fused features and recover fine-grained target cues. Extensive experiments on the challenging WideIRSTD-Full benchmark demonstrate that AuxDet consistently outperforms state-of-the-art methods, validating the critical role of auxiliary information in improving robustness and accuracy in omni-domain IRSTD tasks. Code is available at https://github.com/GrokCV/AuxDet.
CVDec 15, 2024Code
From Easy to Hard: Progressive Active Learning Framework for Infrared Small Target Detection with Single Point SupervisionChuang Yu, Jinmiao Zhao, Yunpeng Liu et al.
Recently, single-frame infrared small target (SIRST) detection with single point supervision has drawn wide-spread attention. However, the latest label evolution with single point supervision (LESPS) framework suffers from instability, excessive label evolution, and difficulty in exerting embedded network performance. Inspired by organisms gradually adapting to their environment and continuously accumulating knowledge, we construct an innovative Progressive Active Learning (PAL) framework for single point supervision, which drives the existing SIRST detection networks progressively and actively recognizes and learns more hard samples to achieve significant performance improvements. Specifically, to avoid the early low-performance model leading to the wrong selection of hard samples, we propose a model pre-start concept, which focuses on automatically selecting a portion of easy samples and helping the model have basic task-specific learning capabilities. Meanwhile, we propose a refined dual-update strategy, which can promote reasonable learning of harder samples and continuous refinement of pseudo-labels. In addition, to alleviate the risk of excessive label evolution, a decay factor is reasonably introduced, which helps to achieve a dynamic balance between the expansion and contraction of target annotations. Extensive experiments show that existing SIRST detection networks equipped with our PAL framework have achieved state-of-the-art (SOTA) results on multiple public datasets. Furthermore, our PAL framework can build an efficient and stable bridge between full supervision and single point supervision tasks. Our code are available at https://github.com/YuChuang1205/PAL.
CVJul 13, 2025Code
SeqCSIST: Sequential Closely-Spaced Infrared Small Target UnmixingXimeng Zhai, Bohan Xu, Yaohong Chen et al.
Due to the limitation of the optical lens focal length and the resolution of the infrared detector, distant Closely-Spaced Infrared Small Target (CSIST) groups typically appear as mixing spots in the infrared image. In this paper, we propose a novel task, Sequential CSIST Unmixing, namely detecting all targets in the form of sub-pixel localization from a highly dense CSIST group. However, achieving such precise detection is an extremely difficult challenge. In addition, the lack of high-quality public datasets has also restricted the research progress. To this end, firstly, we contribute an open-source ecosystem, including SeqCSIST, a sequential benchmark dataset, and a toolkit that provides objective evaluation metrics for this special task, along with the implementation of 23 relevant methods. Furthermore, we propose the Deformable Refinement Network (DeRefNet), a model-driven deep learning framework that introduces a Temporal Deformable Feature Alignment (TDFA) module enabling adaptive inter-frame information aggregation. To the best of our knowledge, this work is the first endeavor to address the CSIST Unmixing task within a multi-frame paradigm. Experiments on the SeqCSIST dataset demonstrate that our method outperforms the state-of-the-art approaches with mean Average Precision (mAP) metric improved by 5.3\%. Our dataset and toolkit are available from https://github.com/GrokCV/SeqCSIST.
CVMay 25, 2025Code
DISTA-Net: Dynamic Closely-Spaced Infrared Small Target UnmixingShengdong Han, Shangdong Yang, Xin Zhang et al.
Resolving closely-spaced small targets in dense clusters presents a significant challenge in infrared imaging, as the overlapping signals hinder precise determination of their quantity, sub-pixel positions, and radiation intensities. While deep learning has advanced the field of infrared small target detection, its application to closely-spaced infrared small targets has not yet been explored. This gap exists primarily due to the complexity of separating superimposed characteristics and the lack of an open-source infrastructure. In this work, we propose the Dynamic Iterative Shrinkage Thresholding Network (DISTA-Net), which reconceptualizes traditional sparse reconstruction within a dynamic framework. DISTA-Net adaptively generates convolution weights and thresholding parameters to tailor the reconstruction process in real time. To the best of our knowledge, DISTA-Net is the first deep learning model designed specifically for the unmixing of closely-spaced infrared small targets, achieving superior sub-pixel detection accuracy. Moreover, we have established the first open-source ecosystem to foster further research in this field. This ecosystem comprises three key components: (1) CSIST-100K, a publicly available benchmark dataset; (2) CSO-mAP, a custom evaluation metric for sub-pixel detection; and (3) GrokCSO, an open-source toolkit featuring DISTA-Net and other models. Our code and dataset are available at https://github.com/GrokCV/GrokCSO.
CVDec 5, 2025Code
Rethinking Infrared Small Target Detection: A Foundation-Driven Efficient ParadigmChuang Yu, Jinmiao Zhao, Yunpeng Liu et al.
While large-scale visual foundation models (VFMs) exhibit strong generalization across diverse visual domains, their potential for single-frame infrared small target (SIRST) detection remains largely unexplored. To fill this gap, we systematically introduce the frozen representations from VFMs into the SIRST task for the first time and propose a Foundation-Driven Efficient Paradigm (FDEP), which can seamlessly adapt to existing encoder-decoder-based methods and significantly improve accuracy without additional inference overhead. Specifically, a Semantic Alignment Modulation Fusion (SAMF) module is designed to achieve dynamic alignment and deep fusion of the global semantic priors from VFMs with task-specific features. Meanwhile, to avoid the inference time burden introduced by VFMs, we propose a Collaborative Optimization-based Implicit Self-Distillation (CO-ISD) strategy, which enables implicit semantic transfer between the main and lightweight branches through parameter sharing and synchronized backpropagation. In addition, to unify the fragmented evaluation system, we construct a Holistic SIRST Evaluation (HSE) metric that performs multi-threshold integral evaluation at both pixel-level confidence and target-level robustness, providing a stable and comprehensive basis for fair model comparison. Extensive experiments demonstrate that the SIRST detection networks equipped with our FDEP framework achieve state-of-the-art (SOTA) performance on multiple public datasets. Our code is available at https://github.com/YuChuang1205/FDEP-Framework
CVJun 5, 2024Code
DenoDet: Attention as Deformable Multi-Subspace Feature Denoising for Target Detection in SAR ImagesYimian Dai, Minrui Zou, Yuxuan Li et al.
Synthetic Aperture Radar (SAR) target detection has long been impeded by inherent speckle noise and the prevalence of diminutive, ambiguous targets. While deep neural networks have advanced SAR target detection, their intrinsic low-frequency bias and static post-training weights falter with coherent noise and preserving subtle details across heterogeneous terrains. Motivated by traditional SAR image denoising, we propose DenoDet, a network aided by explicit frequency domain transform to calibrate convolutional biases and pay more attention to high-frequencies, forming a natural multi-scale subspace representation to detect targets from the perspective of multi-subspace denoising. We design TransDeno, a dynamic frequency domain attention module that performs as a transform domain soft thresholding operation, dynamically denoising across subspaces by preserving salient target signals and attenuating noise. To adaptively adjust the granularity of subspace processing, we also propose a deformable group fully-connected layer (DeGroFC) that dynamically varies the group conditioned on the input features. Without bells and whistles, our plug-and-play TransDeno sets state-of-the-art scores on multiple SAR target detection datasets. The code is available at https://github.com/GrokCV/GrokSAR.
CVJun 4, 2024Code
Multi-Scale Direction-Aware Network for Infrared Small Target DetectionJinmiao Zhao, Zelin Shi, Chuang Yu et al.
Infrared small target detection faces the problem that it is difficult to effectively separate the background and the target. Existing deep learning-based methods focus on edge and shape features, but ignore the richer structural differences and detailed information embedded in high-frequency components from different directions, thereby failing to fully exploit the value of high-frequency directional features in target perception. To address this limitation, we propose a multi-scale direction-aware network (MSDA-Net), which is the first attempt to integrate the high-frequency directional features of infrared small targets as domain prior knowledge into neural networks. Specifically, to fully mine the high-frequency directional features, on the one hand, a high-frequency direction injection (HFDI) module without trainable parameters is constructed to inject the high-frequency directional information of the original image into the network. On the other hand, a multi-scale direction-aware (MSDA) module is constructed, which promotes the full extraction of local relations at different scales and the full perception of key features in different directions. In addition, considering the characteristics of infrared small targets, we construct a feature aggregation (FA) structure to address target disappearance in high-level feature maps, and a feature calibration fusion (FCF) module to alleviate feature bias during cross-layer feature fusion. Extensive experimental results show that our MSDA-Net achieves state-of-the-art (SOTA) results on multiple public datasets. The code can be available at https://github.com/YuChuang1205/MSDA-Net
CVAug 6, 2025
RPCANet++: Deep Interpretable Robust PCA for Sparse Object SegmentationFengyi Wu, Yimian Dai, Tianfang Zhang et al.
Robust principal component analysis (RPCA) decomposes an observation matrix into low-rank background and sparse object components. This capability has enabled its application in tasks ranging from image restoration to segmentation. However, traditional RPCA models suffer from computational burdens caused by matrix operations, reliance on finely tuned hyperparameters, and rigid priors that limit adaptability in dynamic scenarios. To solve these limitations, we propose RPCANet++, a sparse object segmentation framework that fuses the interpretability of RPCA with efficient deep architectures. Our approach unfolds a relaxed RPCA model into a structured network comprising a Background Approximation Module (BAM), an Object Extraction Module (OEM), and an Image Restoration Module (IRM). To mitigate inter-stage transmission loss in the BAM, we introduce a Memory-Augmented Module (MAM) to enhance background feature preservation, while a Deep Contrast Prior Module (DCPM) leverages saliency cues to expedite object extraction. Extensive experiments on diverse datasets demonstrate that RPCANet++ achieves state-of-the-art performance under various imaging scenarios. We further improve interpretability via visual and numerical low-rankness and sparsity measurements. By combining the theoretical strengths of RPCA with the efficiency of deep networks, our approach sets a new baseline for reliable and interpretable sparse object segmentation. Codes are available at our Project Webpage https://fengyiwu98.github.io/rpcanetx.
CVAug 9, 2025
NS-FPN: Improving Infrared Small Target Detection and Segmentation from Noise Suppression PerspectiveMaoxun Yuan, Duanni Meng, Ziteng Xi et al.
Infrared small target detection and segmentation (IRSTDS) is a critical yet challenging task in defense and civilian applications, owing to the dim, shapeless appearance of targets and severe background clutter. Recent CNN-based methods have achieved promising target perception results, but they only focus on enhancing feature representation to offset the impact of noise, which results in the increased false alarms problem. In this paper, through analyzing the problem from the frequency domain, we pioneer in improving performance from noise suppression perspective and propose a novel noise-suppression feature pyramid network (NS-FPN), which integrates a low-frequency guided feature purification (LFP) module and a spiral-aware feature sampling (SFS) module into the original FPN structure. The LFP module suppresses the noise features by purifying high-frequency components to achieve feature enhancement devoid of noise interference, while the SFS module further adopts spiral sampling to fuse target-relevant features in feature fusion process. Our NS-FPN is designed to be lightweight yet effective and can be easily plugged into existing IRSTDS frameworks. Extensive experiments on the public IRSTDS datasets demonstrate that our method significantly reduces false alarms and achieves superior performance on IRSTDS tasks.
CVMay 8, 2025
A Simple Detector with Frame Dynamics is a Strong TrackerChenxu Peng, Chenxu Wang, Minrui Zou et al.
Infrared object tracking plays a crucial role in Anti-Unmanned Aerial Vehicle (Anti-UAV) applications. Existing trackers often depend on cropped template regions and have limited motion modeling capabilities, which pose challenges when dealing with tiny targets. To address this, we propose a simple yet effective infrared tiny-object tracker that enhances tracking performance by integrating global detection and motion-aware learning with temporal priors. Our method is based on object detection and achieves significant improvements through two key innovations. First, we introduce frame dynamics, leveraging frame difference and optical flow to encode both prior target features and motion characteristics at the input level, enabling the model to better distinguish the target from background clutter. Second, we propose a trajectory constraint filtering strategy in the post-processing stage, utilizing spatio-temporal priors to suppress false positives and enhance tracking robustness. Extensive experiments show that our method consistently outperforms existing approaches across multiple metrics in challenging infrared UAV tracking scenarios. Notably, we achieve state-of-the-art performance in the 4th Anti-UAV Challenge, securing 1st place in Track 1 and 2nd place in Track 2.
CVDec 15, 2020
Attentional Local Contrast Networks for Infrared Small Target DetectionYimian Dai, Yiquan Wu, Fei Zhou et al.
To mitigate the issue of minimal intrinsic features for pure data-driven methods, in this paper, we propose a novel model-driven deep network for infrared small target detection, which combines discriminative networks and conventional model-driven methods to make use of both labeled data and the domain knowledge. By designing a feature map cyclic shift scheme, we modularize a conventional local contrast measure method as a depth-wise parameterless nonlinear feature refinement layer in an end-to-end network, which encodes relatively long-range contextual interactions with clear physical interpretability. To highlight and preserve the small target features, we also exploit a bottom-up attentional modulation integrating the smaller scale subtle details of low-level features into high-level features of deeper layers. We conduct detailed ablation studies with varying network depths to empirically verify the effectiveness and efficiency of the design of each component in our network architecture. We also compare the performance of our network against other model-driven methods and deep networks on the open SIRST dataset as well. The results suggest that our network yields a performance boost over its competitors. Our code, trained models, and results are available online.
CVSep 30, 2020
Asymmetric Contextual Modulation for Infrared Small Target DetectionYimian Dai, Yiquan Wu, Fei Zhou et al.
Single-frame infrared small target detection remains a challenge not only due to the scarcity of intrinsic target characteristics but also because of lacking a public dataset. In this paper, we first contribute an open dataset with high-quality annotations to advance the research in this field. We also propose an asymmetric contextual modulation module specially designed for detecting infrared small targets. To better highlight small targets, besides a top-down global contextual feedback, we supplement a bottom-up modulation pathway based on point-wise channel attention for exchanging high-level semantics and subtle low-level details. We report ablation studies and comparisons to state-of-the-art methods, where we find that our approach performs significantly better. Our dataset and code are available online.
CVSep 29, 2020
Attentional Feature FusionYimian Dai, Fabian Gieseke, Stefan Oehmcke et al.
Feature fusion, the combination of features from different layers or branches, is an omnipresent part of modern network architectures. It is often implemented via simple operations, such as summation or concatenation, but this might not be the best choice. In this work, we propose a uniform and general scheme, namely attentional feature fusion, which is applicable for most common scenarios, including feature fusion induced by short and long skip connections as well as within Inception layers. To better fuse features of inconsistent semantics and scales, we propose a multi-scale channel attention module, which addresses issues that arise when fusing features given at different scales. We also demonstrate that the initial integration of feature maps can become a bottleneck and that this issue can be alleviated by adding another level of attention, which we refer to as iterative attentional feature fusion. With fewer layers or parameters, our models outperform state-of-the-art networks on both CIFAR-100 and ImageNet datasets, which suggests that more sophisticated attention mechanisms for feature fusion hold great potential to consistently yield better results compared to their direct counterparts. Our codes and trained models are available online.
CVJul 15, 2020
Attention as ActivationYimian Dai, Stefan Oehmcke, Fabian Gieseke et al.
Activation functions and attention mechanisms are typically treated as having different purposes and have evolved differently. However, both concepts can be formulated as a non-linear gating function. Inspired by their similarity, we propose a novel type of activation units called attentional activation (ATAC) units as a unification of activation functions and attention mechanisms. In particular, we propose a local channel attention module for the simultaneous non-linear activation and element-wise feature refinement, which locally aggregates point-wise cross-channel feature contexts. By replacing the well-known rectified linear units by such ATAC units in convolutional networks, we can construct fully attentional networks that perform significantly better with a modest number of additional parameters. We conducted detailed ablation studies on the ATAC units using several host networks with varying network depths to empirically verify the effectiveness and efficiency of the units. Furthermore, we compared the performance of the ATAC units against existing activation functions as well as other attention mechanisms on the CIFAR-10, CIFAR-100, and ImageNet datasets. Our experimental results show that networks constructed with the proposed ATAC units generally yield performance gains over their competitors given a comparable number of parameters.
CVMar 27, 2017
Reweighted Infrared Patch-Tensor Model With Both Non-Local and Local Priors for Single-Frame Small Target DetectionYimian Dai, Yiquan Wu
Many state-of-the-art methods have been proposed for infrared small target detection. They work well on the images with homogeneous backgrounds and high-contrast targets. However, when facing highly heterogeneous backgrounds, they would not perform very well, mainly due to: 1) the existence of strong edges and other interfering components, 2) not utilizing the priors fully. Inspired by this, we propose a novel method to exploit both local and non-local priors simultaneously. Firstly, we employ a new infrared patch-tensor (IPT) model to represent the image and preserve its spatial correlations. Exploiting the target sparse prior and background non-local self-correlation prior, the target-background separation is modeled as a robust low-rank tensor recovery problem. Moreover, with the help of the structure tensor and reweighted idea, we design an entry-wise local-structure-adaptive and sparsity enhancing weight to replace the globally constant weighting parameter. The decomposition could be achieved via the element-wise reweighted higher-order robust principal component analysis with an additional convergence condition according to the practical situation of target detection. Extensive experiments demonstrate that our model outperforms the other state-of-the-arts, in particular for the images with very dim targets and heavy clutters.