Junwei Han

CV
h-index55
98papers
13,911citations
Novelty49%
AI Score61

98 Papers

CVApr 30, 2023Code
Discriminative Co-Saliency and Background Mining Transformer for Co-Salient Object Detection

Long Li, Junwei Han, Ni Zhang et al.

Most previous co-salient object detection works mainly focus on extracting co-salient cues via mining the consistency relations across images while ignoring explicit exploration of background regions. In this paper, we propose a Discriminative co-saliency and background Mining Transformer framework (DMT) based on several economical multi-grained correlation modules to explicitly mine both co-saliency and background information and effectively model their discrimination. Specifically, we first propose a region-to-region correlation module for introducing inter-image relations to pixel-wise segmentation features while maintaining computational efficiency. Then, we use two types of pre-defined tokens to mine co-saliency and background information via our proposed contrast-induced pixel-to-token correlation and co-saliency token-to-token correlation modules. We also design a token-guided feature refinement module to enhance the discriminability of the segmentation features under the guidance of the learned tokens. We perform iterative mutual promotion for the segmentation feature extraction and token construction. Experimental results on three benchmark datasets demonstrate the effectiveness of our proposed method. The source code is available at: https://github.com/dragonlee258079/DMT.

CVMar 15, 2022Code
Learning What Not to Segment: A New Perspective on Few-Shot Segmentation

Chunbo Lang, Gong Cheng, Binfei Tu et al.

Recently few-shot segmentation (FSS) has been extensively developed. Most previous works strive to achieve generalization through the meta-learning framework derived from classification tasks; however, the trained models are biased towards the seen classes instead of being ideally class-agnostic, thus hindering the recognition of new concepts. This paper proposes a fresh and straightforward insight to alleviate the problem. Specifically, we apply an additional branch (base learner) to the conventional FSS model (meta learner) to explicitly identify the targets of base classes, i.e., the regions that do not need to be segmented. Then, the coarse results output by these two learners in parallel are adaptively integrated to yield precise segmentation prediction. Considering the sensitivity of meta learner, we further introduce an adjustment factor to estimate the scene differences between the input image pairs for facilitating the model ensemble forecasting. The substantial performance gains on PASCAL-5i and COCO-20i verify the effectiveness, and surprisingly, our versatile scheme sets a new state-of-the-art even with two plain learners. Moreover, in light of the unique nature of the proposed approach, we also extend it to a more realistic but challenging setting, i.e., generalized FSS, where the pixels of both base and novel classes are required to be determined. The source code is available at github.com/chunbolang/BAM.

CVNov 25, 2023Code
VSCode: General Visual Salient and Camouflaged Object Detection with 2D Prompt Learning

Ziyang Luo, Nian Liu, Wangbo Zhao et al.

Salient object detection (SOD) and camouflaged object detection (COD) are related yet distinct binary mapping tasks. These tasks involve multiple modalities, sharing commonalities and unique cues. Existing research often employs intricate task-specific specialist models, potentially leading to redundancy and suboptimal results. We introduce VSCode, a generalist model with novel 2D prompt learning, to jointly address four SOD tasks and three COD tasks. We utilize VST as the foundation model and introduce 2D prompts within the encoder-decoder architecture to learn domain and task-specific knowledge on two separate dimensions. A prompt discrimination loss helps disentangle peculiarities to benefit model optimization. VSCode outperforms state-of-the-art methods across six tasks on 26 datasets and exhibits zero-shot generalization to unseen tasks by combining 2D prompts, such as RGB-D COD. Source code has been available at https://github.com/Sssssuperior/VSCode.

CVMar 2, 2022Code
Colar: Effective and Efficient Online Action Detection by Consulting Exemplars

Le Yang, Junwei Han, Dingwen Zhang

Online action detection has attracted increasing research interests in recent years. Current works model historical dependencies and anticipate the future to perceive the action evolution within a video segment and improve the detection accuracy. However, the existing paradigm ignores category-level modeling and does not pay sufficient attention to efficiency. Considering a category, its representative frames exhibit various characteristics. Thus, the category-level modeling can provide complimentary guidance to the temporal dependencies modeling. This paper develops an effective exemplar-consultation mechanism that first measures the similarity between a frame and exemplary frames, and then aggregates exemplary features based on the similarity weights. This is also an efficient mechanism, as both similarity measurement and feature aggregation require limited computations. Based on the exemplar-consultation mechanism, the long-term dependencies can be captured by regarding historical frames as exemplars, while the category-level modeling can be achieved by regarding representative frames from a category as exemplars. Due to the complementarity from the category-level modeling, our method employs a lightweight architecture but achieves new high performance on three benchmarks. In addition, using a spatio-temporal network to tackle video frames, our method makes a good trade-off between effectiveness and efficiency. Code is available at https://github.com/VividLe/Online-Action-Detection.

CVDec 26, 2022Code
Fewer is More: Efficient Object Detection in Large Aerial Images

Xingxing Xie, Gong Cheng, Qingyang Li et al.

Current mainstream object detection methods for large aerial images usually divide large images into patches and then exhaustively detect the objects of interest on all patches, no matter whether there exist objects or not. This paradigm, although effective, is inefficient because the detectors have to go through all patches, severely hindering the inference speed. This paper presents an Objectness Activation Network (OAN) to help detectors focus on fewer patches but achieve more efficient inference and more accurate results, enabling a simple and effective solution to object detection in large images. In brief, OAN is a light fully-convolutional network for judging whether each patch contains objects or not, which can be easily integrated into many object detectors and jointly trained with them end-to-end. We extensively evaluate our OAN with five advanced detectors. Using OAN, all five detectors acquire more than 30.0% speed-up on three large-scale aerial image datasets, meanwhile with consistent accuracy improvements. On extremely large Gaofen-2 images (29200$\times$27620 pixels), our OAN improves the detection speed by 70.5%. Moreover, we extend our OAN to driving-scene object detection and 4K video object detection, boosting the detection speed by 112.1% and 75.0%, respectively, without sacrificing the accuracy. Code is available at https://github.com/Ranchosky/OAN.

CVApr 21, 2022Code
Beyond the Prototype: Divide-and-conquer Proxies for Few-shot Segmentation

Chunbo Lang, Binfei Tu, Gong Cheng et al.

Few-shot segmentation, which aims to segment unseen-class objects given only a handful of densely labeled samples, has received widespread attention from the community. Existing approaches typically follow the prototype learning paradigm to perform meta-inference, which fails to fully exploit the underlying information from support image-mask pairs, resulting in various segmentation failures, e.g., incomplete objects, ambiguous boundaries, and distractor activation. To this end, we propose a simple yet versatile framework in the spirit of divide-and-conquer. Specifically, a novel self-reasoning scheme is first implemented on the annotated support image, and then the coarse segmentation mask is divided into multiple regions with different properties. Leveraging effective masked average pooling operations, a series of support-induced proxies are thus derived, each playing a specific role in conquering the above challenges. Moreover, we devise a unique parallel decoder structure that integrates proxies with similar attributes to boost the discrimination power. Our proposed approach, named divide-and-conquer proxies (DCP), allows for the development of appropriate and reliable information as a guide at the "episode" level, not just about the object cues themselves. Extensive experiments on PASCAL-5i and COCO-20i demonstrate the superiority of DCP over conventional prototype-based approaches (up to 5~10% on average), which also establishes a new state-of-the-art. Code is available at github.com/chunbolang/DCP.

CVOct 13, 2022Code
Intermediate Prototype Mining Transformer for Few-Shot Semantic Segmentation

Yuanwei Liu, Nian Liu, Xiwen Yao et al.

Few-shot semantic segmentation aims to segment the target objects in query under the condition of a few annotated support images. Most previous works strive to mine more effective category information from the support to match with the corresponding objects in query. However, they all ignored the category information gap between query and support images. If the objects in them show large intra-class diversity, forcibly migrating the category information from the support to the query is ineffective. To solve this problem, we are the first to introduce an intermediate prototype for mining both deterministic category information from the support and adaptive category knowledge from the query. Specifically, we design an Intermediate Prototype Mining Transformer (IPMT) to learn the prototype in an iterative way. In each IPMT layer, we propagate the object information in both support and query features to the prototype and then use it to activate the query feature map. By conducting this process iteratively, both the intermediate prototype and the query feature can be progressively improved. At last, the final query feature is used to yield precise segmentation prediction. Extensive experiments on both PASCAL-5i and COCO-20i datasets clearly verify the effectiveness of our IPMT and show that it outperforms previous state-of-the-art methods by a large margin. Code is available at https://github.com/LIUYUANWEI98/IPMT

CVMay 20, 2022Code
Structured Attention Composition for Temporal Action Localization

Le Yang, Junwei Han, Tao Zhao et al.

Temporal action localization aims at localizing action instances from untrimmed videos. Existing works have designed various effective modules to precisely localize action instances based on appearance and motion features. However, by treating these two kinds of features with equal importance, previous works cannot take full advantage of each modality feature, making the learned model still sub-optimal. To tackle this issue, we make an early effort to study temporal action localization from the perspective of multi-modality feature learning, based on the observation that different actions exhibit specific preferences to appearance or motion modality. Specifically, we build a novel structured attention composition module. Unlike conventional attention, the proposed module would not infer frame attention and modality attention independently. Instead, by casting the relationship between the modality attention and the frame attention as an attention assignment process, the structured attention composition module learns to encode the frame-modality structure and uses it to regularize the inferred frame attention and modality attention, respectively, upon the optimal transport theory. The final frame-modality attention is obtained by the composition of the two individual attentions. The proposed structured attention composition module can be deployed as a plug-and-play module into existing action localization frameworks. Extensive experiments on two widely used benchmarks show that the proposed structured attention composition consistently improves four state-of-the-art temporal action localization methods and builds new state-of-the-art performance on THUMOS14. Code is availabel at https://github.com/VividLe/Structured-Attention-Composition.

CVSep 20, 2023Code
Multi-grained Temporal Prototype Learning for Few-shot Video Object Segmentation

Nian Liu, Kepan Nan, Wangbo Zhao et al.

Few-Shot Video Object Segmentation (FSVOS) aims to segment objects in a query video with the same category defined by a few annotated support images. However, this task was seldom explored. In this work, based on IPMT, a state-of-the-art few-shot image segmentation method that combines external support guidance information with adaptive query guidance cues, we propose to leverage multi-grained temporal guidance information for handling the temporal correlation nature of video data. We decompose the query video information into a clip prototype and a memory prototype for capturing local and long-term internal temporal guidance, respectively. Frame prototypes are further used for each frame independently to handle fine-grained adaptive guidance and enable bidirectional clip-frame prototype communication. To reduce the influence of noisy memory, we propose to leverage the structural similarity relation among different predicted regions and the support for selecting reliable memory frames. Furthermore, a new segmentation loss is also proposed to enhance the category discriminability of the learned prototypes. Experimental results demonstrate that our proposed video IPMT model significantly outperforms previous models on two benchmark datasets. Code is available at https://github.com/nankepan/VIPMT.

CVMay 10, 2022
Learning Non-target Knowledge for Few-shot Semantic Segmentation

Yuanwei Liu, Nian Liu, Qinglong Cao et al.

Existing studies in few-shot semantic segmentation only focus on mining the target object information, however, often are hard to tell ambiguous regions, especially in non-target regions, which include background (BG) and Distracting Objects (DOs). To alleviate this problem, we propose a novel framework, namely Non-Target Region Eliminating (NTRE) network, to explicitly mine and eliminate BG and DO regions in the query. First, a BG Mining Module (BGMM) is proposed to extract the BG region via learning a general BG prototype. To this end, we design a BG loss to supervise the learning of BGMM only using the known target object segmentation ground truth. Then, a BG Eliminating Module and a DO Eliminating Module are proposed to successively filter out the BG and DO information from the query feature, based on which we can obtain a BG and DO-free target object segmentation result. Furthermore, we propose a prototypical contrastive learning algorithm to improve the model ability of distinguishing the target object from DOs. Extensive experiments on both PASCAL-5i and COCO-20i datasets show that our approach is effective despite its simplicity.

CVMar 29, 2022
Cross-Modality High-Frequency Transformer for MR Image Super-Resolution

Chaowei Fang, Dingwen Zhang, Liang Wang et al. · eth-zurich

Improving the resolution of magnetic resonance (MR) image data is critical to computer-aided diagnosis and brain function analysis. Higher resolution helps to capture more detailed content, but typically induces to lower signal-to-noise ratio and longer scanning time. To this end, MR image super-resolution has become a widely-interested topic in recent times. Existing works establish extensive deep models with the conventional architectures based on convolutional neural networks (CNN). In this work, to further advance this research field, we make an early effort to build a Transformer-based MR image super-resolution framework, with careful designs on exploring valuable domain prior knowledge. Specifically, we consider two-fold domain priors including the high-frequency structure prior and the inter-modality context prior, and establish a novel Transformer architecture, called Cross-modality high-frequency Transformer (Cohf-T), to introduce such priors into super-resolving the low-resolution (LR) MR images. Experiments on two datasets indicate that Cohf-T achieves new state-of-the-art performance.

CVJul 28, 2022
Towards Large-Scale Small Object Detection: Survey and Benchmarks

Gong Cheng, Xiang Yuan, Xiwen Yao et al.

With the rise of deep convolutional neural networks, object detection has achieved prominent advances in past years. However, such prosperity could not camouflage the unsatisfactory situation of Small Object Detection (SOD), one of the notoriously challenging tasks in computer vision, owing to the poor visual appearance and noisy representation caused by the intrinsic structure of small targets. In addition, large-scale dataset for benchmarking small object detection methods remains a bottleneck. In this paper, we first conduct a thorough review of small object detection. Then, to catalyze the development of SOD, we construct two large-scale Small Object Detection dAtasets (SODA), SODA-D and SODA-A, which focus on the Driving and Aerial scenarios respectively. SODA-D includes 24828 high-quality traffic images and 278433 instances of nine categories. For SODA-A, we harvest 2513 high resolution aerial images and annotate 872069 instances over nine classes. The proposed datasets, as we know, are the first-ever attempt to large-scale benchmarks with a vast collection of exhaustively annotated instances tailored for multi-category SOD. Finally, we evaluate the performance of mainstream methods on SODA. We expect the released benchmarks could facilitate the development of SOD and spawn more breakthroughs in this field. Datasets and codes are available at: \url{https://shaunyuan22.github.io/SODA}.

CVMar 19, 2022
Learning Self-Supervised Low-Rank Network for Single-Stage Weakly and Semi-Supervised Semantic Segmentation

Junwen Pan, Pengfei Zhu, Kaihua Zhang et al.

Semantic segmentation with limited annotations, such as weakly supervised semantic segmentation (WSSS) and semi-supervised semantic segmentation (SSSS), is a challenging task that has attracted much attention recently. Most leading WSSS methods employ a sophisticated multi-stage training strategy to estimate pseudo-labels as precise as possible, but they suffer from high model complexity. In contrast, there exists another research line that trains a single network with image-level labels in one training cycle. However, such a single-stage strategy often performs poorly because of the compounding effect caused by inaccurate pseudo-label estimation. To address this issue, this paper presents a Self-supervised Low-Rank Network (SLRNet) for single-stage WSSS and SSSS. The SLRNet uses cross-view self-supervision, that is, it simultaneously predicts several complementary attentive LR representations from different views of an image to learn precise pseudo-labels. Specifically, we reformulate the LR representation learning as a collective matrix factorization problem and optimize it jointly with the network learning in an end-to-end manner. The resulting LR representation deprecates noisy information while capturing stable semantics across different views, making it robust to the input variations, thereby reducing overfitting to self-supervision errors. The SLRNet can provide a unified single-stage framework for various label-efficient semantic segmentation settings: 1) WSSS with image-level labeled data, 2) SSSS with a few pixel-level labeled data, and 3) SSSS with a few pixel-level labeled data and many image-level labeled data. Extensive experiments on the Pascal VOC 2012, COCO, and L2ID datasets demonstrate that our SLRNet outperforms both state-of-the-art WSSS and SSSS methods with a variety of different settings, proving its good generalizability and efficacy.

CVSep 11, 2024Code
Retinex-RAWMamba: Bridging Demosaicing and Denoising for Low-Light RAW Image Enhancement

Xianmin Chen, Longfei Han, Peiliang Huang et al.

Low-light image enhancement, particularly in cross-domain tasks such as mapping from the raw domain to the sRGB domain, remains a significant challenge. Many deep learning-based methods have been developed to address this issue and have shown promising results in recent years. However, single-stage methods, which attempt to unify the complex mapping across both domains, leading to limited denoising performance. In contrast, existing two-stage approaches typically overlook the characteristic of demosaicing within the Image Signal Processing (ISP) pipeline, leading to color distortions under varying lighting conditions, especially in low-light scenarios. To address these issues, we propose a novel Mamba-based method customized for low light RAW images, called RAWMamba, to effectively handle raw images with different CFAs. Furthermore, we introduce a Retinex Decomposition Module (RDM) grounded in Retinex prior, which decouples illumination from reflectance to facilitate more effective denoising and automatic non-linear exposure correction, reducing the effect of manual linear illumination enhancement. By bridging demosaicing and denoising, better enhancement for low light RAW images is achieved. Experimental evaluations conducted on public datasets SID and MCR demonstrate that our proposed RAWMamba achieves state-of-the-art performance on cross-domain mapping. The code is available at https://github.com/Cynicarlos/RetinexRawMamba.

CVAug 18, 2023
Small Object Detection via Coarse-to-fine Proposal Generation and Imitation Learning

Xiang Yuan, Gong Cheng, Kebing Yan et al.

The past few years have witnessed the immense success of object detection, while current excellent detectors struggle on tackling size-limited instances. Concretely, the well-known challenge of low overlaps between the priors and object regions leads to a constrained sample pool for optimization, and the paucity of discriminative information further aggravates the recognition. To alleviate the aforementioned issues, we propose CFINet, a two-stage framework tailored for small object detection based on the Coarse-to-fine pipeline and Feature Imitation learning. Firstly, we introduce Coarse-to-fine RPN (CRPN) to ensure sufficient and high-quality proposals for small objects through the dynamic anchor selection strategy and cascade regression. Then, we equip the conventional detection head with a Feature Imitation (FI) branch to facilitate the region representations of size-limited instances that perplex the model in an imitation manner. Moreover, an auxiliary imitation loss following supervised contrastive learning paradigm is devised to optimize this branch. When integrated with Faster RCNN, CFINet achieves state-of-the-art performance on the large-scale small object detection benchmarks, SODA-D and SODA-A, underscoring its superiority over baseline detector and other mainstream detection approaches.

CLApr 21, 2023
ChatABL: Abductive Learning via Natural Language Interaction with ChatGPT

Tianyang Zhong, Yaonai Wei, Li Yang et al.

Large language models (LLMs) such as ChatGPT have recently demonstrated significant potential in mathematical abilities, providing valuable reasoning paradigm consistent with human natural language. However, LLMs currently have difficulty in bridging perception, language understanding and reasoning capabilities due to incompatibility of the underlying information flow among them, making it challenging to accomplish tasks autonomously. On the other hand, abductive learning (ABL) frameworks for integrating the two abilities of perception and reasoning has seen significant success in inverse decipherment of incomplete facts, but it is limited by the lack of semantic understanding of logical reasoning rules and the dependence on complicated domain knowledge representation. This paper presents a novel method (ChatABL) for integrating LLMs into the ABL framework, aiming at unifying the three abilities in a more user-friendly and understandable manner. The proposed method uses the strengths of LLMs' understanding and logical reasoning to correct the incomplete logical facts for optimizing the performance of perceptual module, by summarizing and reorganizing reasoning rules represented in natural language format. Similarly, perceptual module provides necessary reasoning examples for LLMs in natural language format. The variable-length handwritten equation deciphering task, an abstract expression of the Mayan calendar decoding, is used as a testbed to demonstrate that ChatABL has reasoning ability beyond most existing state-of-the-art methods, which has been well supported by comparative studies. To our best knowledge, the proposed ChatABL is the first attempt to explore a new pattern for further approaching human-level cognitive ability via natural language interaction with ChatGPT.

CVFeb 2, 2023
Boosting Low-Data Instance Segmentation by Unsupervised Pre-training with Saliency Prompt

Hao Li, Dingwen Zhang, Nian Liu et al.

Recently, inspired by DETR variants, query-based end-to-end instance segmentation (QEIS) methods have outperformed CNN-based models on large-scale datasets. Yet they would lose efficacy when only a small amount of training data is available since it's hard for the crucial queries/kernels to learn localization and shape priors. To this end, this work offers a novel unsupervised pre-training solution for low-data regimes. Inspired by the recent success of the Prompting technique, we introduce a new pre-training method that boosts QEIS models by giving Saliency Prompt for queries/kernels. Our method contains three parts: 1) Saliency Masks Proposal is responsible for generating pseudo masks from unlabeled images based on the saliency mechanism. 2) Prompt-Kernel Matching transfers pseudo masks into prompts and injects the corresponding localization and shape priors to the best-matched kernels. 3) Kernel Supervision is applied to supply supervision at the kernel level for robust learning. From a practical perspective, our pre-training method helps QEIS models achieve a similar convergence speed and comparable performance with CNN-based models in low-data regimes. Experimental results show that our method significantly boosts several QEIS models on three datasets. Code will be made available.

CVSep 28, 2024Code
3D-CT-GPT: Generating 3D Radiology Reports through Integration of Large Vision-Language Models

Hao Chen, Wei Zhao, Yingli Li et al.

Medical image analysis is crucial in modern radiological diagnostics, especially given the exponential growth in medical imaging data. The demand for automated report generation systems has become increasingly urgent. While prior research has mainly focused on using machine learning and multimodal language models for 2D medical images, the generation of reports for 3D medical images has been less explored due to data scarcity and computational complexities. This paper introduces 3D-CT-GPT, a Visual Question Answering (VQA)-based medical visual language model specifically designed for generating radiology reports from 3D CT scans, particularly chest CTs. Extensive experiments on both public and private datasets demonstrate that 3D-CT-GPT significantly outperforms existing methods in terms of report accuracy and quality. Although current methods are few, including the partially open-source CT2Rep and the open-source M3D, we ensured fair comparison through appropriate data conversion and evaluation methodologies. Experimental results indicate that 3D-CT-GPT enhances diagnostic accuracy and report coherence, establishing itself as a robust solution for clinical radiology report generation. Future work will focus on expanding the dataset and further optimizing the model to enhance its performance and applicability.

CVNov 20, 2023
GP-NeRF: Generalized Perception NeRF for Context-Aware 3D Scene Understanding

Hao Li, Dingwen Zhang, Yalun Dai et al.

Applying NeRF to downstream perception tasks for scene understanding and representation is becoming increasingly popular. Most existing methods treat semantic prediction as an additional rendering task, \textit{i.e.}, the "label rendering" task, to build semantic NeRFs. However, by rendering semantic/instance labels per pixel without considering the contextual information of the rendered image, these methods usually suffer from unclear boundary segmentation and abnormal segmentation of pixels within an object. To solve this problem, we propose Generalized Perception NeRF (GP-NeRF), a novel pipeline that makes the widely used segmentation model and NeRF work compatibly under a unified framework, for facilitating context-aware 3D scene perception. To accomplish this goal, we introduce transformers to aggregate radiance as well as semantic embedding fields jointly for novel views and facilitate the joint volumetric rendering of both fields. In addition, we propose two self-distillation mechanisms, i.e., the Semantic Distill Loss and the Depth-Guided Semantic Distill Loss, to enhance the discrimination and quality of the semantic field and the maintenance of geometric consistency. In evaluation, we conduct experimental comparisons under two perception tasks (\textit{i.e.} semantic and instance segmentation) using both synthetic and real-world datasets. Notably, our method outperforms SOTA approaches by 6.94\%, 11.76\%, and 8.47\% on generalized semantic segmentation, finetuning semantic segmentation, and instance segmentation, respectively.

CVFeb 13, 2023
Threatening Patch Attacks on Object Detection in Optical Remote Sensing Images

Xuxiang Sun, Gong Cheng, Lei Pei et al.

Advanced Patch Attacks (PAs) on object detection in natural images have pointed out the great safety vulnerability in methods based on deep neural networks. However, little attention has been paid to this topic in Optical Remote Sensing Images (O-RSIs). To this end, we focus on this research, i.e., PAs on object detection in O-RSIs, and propose a more Threatening PA without the scarification of the visual quality, dubbed TPA. Specifically, to address the problem of inconsistency between local and global landscapes in existing patch selection schemes, we propose leveraging the First-Order Difference (FOD) of the objective function before and after masking to select the sub-patches to be attacked. Further, considering the problem of gradient inundation when applying existing coordinate-based loss to PAs directly, we design an IoU-based objective function specific for PAs, dubbed Bounding box Drifting Loss (BDL), which pushes the detected bounding boxes far from the initial ones until there are no intersections between them. Finally, on two widely used benchmarks, i.e., DIOR and DOTA, comprehensive evaluations of our TPA with four typical detectors (Faster R-CNN, FCOS, RetinaNet, and YOLO-v4) witness its remarkable effectiveness. To the best of our knowledge, this is the first attempt to study the PAs on object detection in O-RSIs, and we hope this work can get our readers interested in studying this topic.

CLSep 10, 2023
Chat2Brain: A Method for Mapping Open-Ended Semantic Queries to Brain Activation Maps

Yaonai Wei, Tuo Zhang, Han Zhang et al.

Over decades, neuroscience has accumulated a wealth of research results in the text modality that can be used to explore cognitive processes. Meta-analysis is a typical method that successfully establishes a link from text queries to brain activation maps using these research results, but it still relies on an ideal query environment. In practical applications, text queries used for meta-analyses may encounter issues such as semantic redundancy and ambiguity, resulting in an inaccurate mapping to brain images. On the other hand, large language models (LLMs) like ChatGPT have shown great potential in tasks such as context understanding and reasoning, displaying a high degree of consistency with human natural language. Hence, LLMs could improve the connection between text modality and neuroscience, resolving existing challenges of meta-analyses. In this study, we propose a method called Chat2Brain that combines LLMs to basic text-2-image model, known as Text2Brain, to map open-ended semantic queries to brain activation maps in data-scarce and complex query environments. By utilizing the understanding and reasoning capabilities of LLMs, the performance of the mapping model is optimized by transferring text queries to semantic queries. We demonstrate that Chat2Brain can synthesize anatomically plausible neural activation patterns for more complex tasks of text queries.

CLOct 8, 2023
ChatRadio-Valuer: A Chat Large Language Model for Generalizable Radiology Report Generation Based on Multi-institution and Multi-system Data

Tianyang Zhong, Wei Zhao, Yutong Zhang et al.

Radiology report generation, as a key step in medical image analysis, is critical to the quantitative analysis of clinically informed decision-making levels. However, complex and diverse radiology reports with cross-source heterogeneity pose a huge generalizability challenge to the current methods under massive data volume, mainly because the style and normativity of radiology reports are obviously distinctive among institutions, body regions inspected and radiologists. Recently, the advent of large language models (LLM) offers great potential for recognizing signs of health conditions. To resolve the above problem, we collaborate with the Second Xiangya Hospital in China and propose ChatRadio-Valuer based on the LLM, a tailored model for automatic radiology report generation that learns generalizable representations and provides a basis pattern for model adaptation in sophisticated analysts' cases. Specifically, ChatRadio-Valuer is trained based on the radiology reports from a single institution by means of supervised fine-tuning, and then adapted to disease diagnosis tasks for human multi-system evaluation (i.e., chest, abdomen, muscle-skeleton, head, and maxillofacial $\&$ neck) from six different institutions in clinical-level events. The clinical dataset utilized in this study encompasses a remarkable total of \textbf{332,673} observations. From the comprehensive results on engineering indicators, clinical efficacy and deployment cost metrics, it can be shown that ChatRadio-Valuer consistently outperforms state-of-the-art models, especially ChatGPT (GPT-3.5-Turbo) and GPT-4 et al., in terms of the diseases diagnosis from radiology reports. ChatRadio-Valuer provides an effective avenue to boost model generalization performance and alleviate the annotation workload of experts to enable the promotion of clinical AI applications in radiology reports.

CVSep 27, 2023
Physics Inspired Hybrid Attention for SAR Target Recognition

Zhongling Huang, Chong Wu, Xiwen Yao et al.

There has been a recent emphasis on integrating physical models and deep neural networks (DNNs) for SAR target recognition, to improve performance and achieve a higher level of physical interpretability. The attributed scattering center (ASC) parameters garnered the most interest, being considered as additional input data or features for fusion in most methods. However, the performance greatly depends on the ASC optimization result, and the fusion strategy is not adaptable to different types of physical information. Meanwhile, the current evaluation scheme is inadequate to assess the model's robustness and generalizability. Thus, we propose a physics inspired hybrid attention (PIHA) mechanism and the once-for-all (OFA) evaluation protocol to address the above issues. PIHA leverages the high-level semantics of physical information to activate and guide the feature group aware of local semantics of target, so as to re-weight the feature importance based on knowledge prior. It is flexible and generally applicable to various physical models, and can be integrated into arbitrary DNNs without modifying the original architecture. The experiments involve a rigorous assessment using the proposed OFA, which entails training and validating a model on either sufficient or limited data and evaluating on multiple test sets with different data distributions. Our method outperforms other state-of-the-art approaches in 12 test scenarios with same ASC parameters. Moreover, we analyze the working mechanism of PIHA and evaluate various PIHA enabled DNNs. The experiments also show PIHA is effective for different physical information. The source code together with the adopted physical information is available at https://github.com/XAI4SAR.

CVOct 18, 2023
VST++: Efficient and Stronger Visual Saliency Transformer

Nian Liu, Ziyang Luo, Ni Zhang et al.

While previous CNN-based models have exhibited promising results for salient object detection (SOD), their ability to explore global long-range dependencies is restricted. Our previous work, the Visual Saliency Transformer (VST), addressed this constraint from a transformer-based sequence-to-sequence perspective, to unify RGB and RGB-D SOD. In VST, we developed a multi-task transformer decoder that concurrently predicts saliency and boundary outcomes in a pure transformer architecture. Moreover, we introduced a novel token upsampling method called reverse T2T for predicting a high-resolution saliency map effortlessly within transformer-based structures. Building upon the VST model, we further propose an efficient and stronger VST version in this work, i.e. VST++. To mitigate the computational costs of the VST model, we propose a Select-Integrate Attention (SIA) module, partitioning foreground into fine-grained segments and aggregating background information into a single coarse-grained token. To incorporate 3D depth information with low cost, we design a novel depth position encoding method tailored for depth maps. Furthermore, we introduce a token-supervised prediction loss to provide straightforward guidance for the task-related tokens. We evaluate our VST++ model across various transformer-based backbones on RGB, RGB-D, and RGB-T SOD benchmark datasets. Experimental results show that our model outperforms existing methods while achieving a 25% reduction in computational costs without significant performance compromise. The demonstrated strong ability for generalization, enhanced performance, and heightened efficiency of our VST++ model highlight its potential.

CVFeb 3, 2023
Revisiting Long-tailed Image Classification: Survey and Benchmarks with New Evaluation Metrics

Chaowei Fang, Dingwen Zhang, Wen Zheng et al.

Recently, long-tailed image classification harvests lots of research attention, since the data distribution is long-tailed in many real-world situations. Piles of algorithms are devised to address the data imbalance problem by biasing the training process towards less frequent classes. However, they usually evaluate the performance on a balanced testing set or multiple independent testing sets having distinct distributions with the training data. Considering the testing data may have arbitrary distributions, existing evaluation strategies are unable to reflect the actual classification performance objectively. We set up novel evaluation benchmarks based on a series of testing sets with evolving distributions. A corpus of metrics are designed for measuring the accuracy, robustness, and bounds of algorithms for learning with long-tailed distribution. Based on our benchmarks, we re-evaluate the performance of existing methods on CIFAR10 and CIFAR100 datasets, which is valuable for guiding the selection of data rebalancing techniques. We also revisit existing methods and categorize them into four types including data balancing, feature balancing, loss balancing, and prediction balancing, according the focused procedure during the training pipeline.

CVSep 2, 2024
CONDA: Condensed Deep Association Learning for Co-Salient Object Detection

Long Li, Nian Liu, Dingwen Zhang et al.

Inter-image association modeling is crucial for co-salient object detection. Despite satisfactory performance, previous methods still have limitations on sufficient inter-image association modeling. Because most of them focus on image feature optimization under the guidance of heuristically calculated raw inter-image associations. They directly rely on raw associations which are not reliable in complex scenarios, and their image feature optimization approach is not explicit for inter-image association modeling. To alleviate these limitations, this paper proposes a deep association learning strategy that deploys deep networks on raw associations to explicitly transform them into deep association features. Specifically, we first create hyperassociations to collect dense pixel-pair-wise raw associations and then deploys deep aggregation networks on them. We design a progressive association generation module for this purpose with additional enhancement of the hyperassociation calculation. More importantly, we propose a correspondence-induced association condensation module that introduces a pretext task, i.e. semantic correspondence estimation, to condense the hyperassociations for computational burden reduction and noise elimination. We also design an object-aware cycle consistency loss for high-quality correspondence estimations. Experimental results in three benchmark datasets demonstrate the remarkable effectiveness of our proposed method with various training settings.

NCMay 21, 2022
Brain Cortical Functional Gradients Predict Cortical Folding Patterns via Attention Mesh Convolution

Li Yang, Zhibin He, Changhe Li et al.

Since gyri and sulci, two basic anatomical building blocks of cortical folding patterns, were suggested to bear different functional roles, a precise mapping from brain function to gyro-sulcal patterns can provide profound insights into both biological and artificial neural networks. However, there lacks a generic theory and effective computational model so far, due to the highly nonlinear relation between them, huge inter-individual variabilities and a sophisticated description of brain function regions/networks distribution as mosaics, such that spatial patterning of them has not been considered. we adopted brain functional gradients derived from resting-state fMRI to embed the "gradual" change of functional connectivity patterns, and developed a novel attention mesh convolution model to predict cortical gyro-sulcal segmentation maps on individual brains. The convolution on mesh considers the spatial organization of functional gradients and folding patterns on a cortical sheet and the newly designed channel attention block enhances the interpretability of the contribution of different functional gradients to cortical folding prediction. Experiments show that the prediction performance via our model outperforms other state-of-the-art models. In addition, we found that the dominant functional gradients contribute less to folding prediction. On the activation maps of the last layer, some well-studied cortical landmarks are found on the borders of, rather than within, the highly activated regions. These results and findings suggest that a specifically designed artificial neural network can improve the precision of the mapping between brain functions and cortical folding patterns, and can provide valuable insight of brain anatomy-function relation for neuroscience.

CLJan 30Code
FNF: Functional Network Fingerprint for Large Language Models

Yiheng Liu, Junhao Ning, Sichen Xia et al.

The development of large language models (LLMs) is costly and has significant commercial value. Consequently, preventing unauthorized appropriation of open-source LLMs and protecting developers' intellectual property rights have become critical challenges. In this work, we propose the Functional Network Fingerprint (FNF), a training-free, sample-efficient method for detecting whether a suspect LLM is derived from a victim model, based on the consistency between their functional network activity. We demonstrate that models that share a common origin, even with differences in scale or architecture, exhibit highly consistent patterns of neuronal activity within their functional networks across diverse input samples. In contrast, models trained independently on distinct data or with different objectives fail to preserve such activity alignment. Unlike conventional approaches, our method requires only a few samples for verification, preserves model utility, and remains robust to common model modifications (such as fine-tuning, pruning, and parameter permutation), as well as to comparisons across diverse architectures and dimensionalities. FNF thus provides model owners and third parties with a simple, non-invasive, and effective tool for protecting LLM intellectual property. The code is available at https://github.com/WhatAboutMyStar/LLM_ACTIVATION.

47.7CVMar 31
Square Superpixel Generation and Representation Learning via Granular Ball Computing

Shuyin Xia, Meng Yang, Dawei Dai et al.

Superpixels provide a compact region-based representation that preserves object boundaries and local structures, and have therefore been widely used in a variety of vision tasks to reduce computational cost. However, most existing superpixel algorithms produce irregularly shaped regions, which are not well aligned with regular operators such as convolutions. Consequently, superpixels are often treated as an offline preprocessing step, limiting parallel implementation and hindering end-to-end optimization within deep learning pipelines. Motivated by the adaptive representation and coverage property of granular-ball computing, we develop a square superpixel generation approach. Specifically, we approximate superpixels using multi-scale square blocks to avoid the computational and implementation difficulties induced by irregular shapes, enabling efficient parallel processing and learnable feature extraction. For each block, a purity score is computed based on pixel-intensity similarity, and high-quality blocks are selected accordingly. The resulting square superpixels can be readily integrated as graph nodes in graph neural networks (GNNs) or as tokens in Vision Transformers (ViTs), facilitating multi-scale information aggregation and structured visual representation. Experimental results on downstream tasks demonstrate consistent performance improvements, validating the effectiveness of the proposed method.

CVNov 20, 2022
Progressively Dual Prior Guided Few-shot Semantic Segmentation

Qinglong Cao, Yuntian Chen, Xiwen Yao et al.

Few-shot semantic segmentation task aims at performing segmentation in query images with a few annotated support samples. Currently, few-shot segmentation methods mainly focus on leveraging foreground information without fully utilizing the rich background information, which could result in wrong activation of foreground-like background regions with the inadaptability to dramatic scene changes of support-query image pairs. Meanwhile, the lack of detail mining mechanism could cause coarse parsing results without some semantic components or edge areas since prototypes have limited ability to cope with large object appearance variance. To tackle these problems, we propose a progressively dual prior guided few-shot semantic segmentation network. Specifically, a dual prior mask generation (DPMG) module is firstly designed to suppress the wrong activation in foreground-background comparison manner by regarding background as assisted refinement information. With dual prior masks refining the location of foreground area, we further propose a progressive semantic detail enrichment (PSDE) module which forces the parsing model to capture the hidden semantic details by iteratively erasing the high-confidence foreground region and activating details in the rest region with a hierarchical structure. The collaboration of DPMG and PSDE formulates a novel few-shot segmentation network that can be learned in an end-to-end manner. Comprehensive experiments on PASCAL-5i and MS COCO powerfully demonstrate that our proposed algorithm achieves the great performance.

SDSep 15, 2024
A Survey of Foundation Models for Music Understanding

Wenjun Li, Ying Cai, Ziyang Wu et al.

Music is essential in daily life, fulfilling emotional and entertainment needs, and connecting us personally, socially, and culturally. A better understanding of music can enhance our emotions, cognitive skills, and cultural connections. The rapid advancement of artificial intelligence (AI) has introduced new ways to analyze music, aiming to replicate human understanding of music and provide related services. While the traditional models focused on audio features and simple tasks, the recent development of large language models (LLMs) and foundation models (FMs), which excel in various fields by integrating semantic information and demonstrating strong reasoning abilities, could capture complex musical features and patterns, integrate music with language and incorporate rich musical, emotional and psychological knowledge. Therefore, they have the potential in handling complex music understanding tasks from a semantic perspective, producing outputs closer to human perception. This work, to our best knowledge, is one of the early reviews of the intersection of AI techniques and music understanding. We investigated, analyzed, and tested recent large-scale music foundation models in respect of their music comprehension abilities. We also discussed their limitations and proposed possible future directions, offering insights for researchers in this field.

CVMar 12, 2024Code
Continual All-in-One Adverse Weather Removal with Knowledge Replay on a Unified Network Structure

De Cheng, Yanling Ji, Dong Gong et al.

In real-world applications, image degeneration caused by adverse weather is always complex and changes with different weather conditions from days and seasons. Systems in real-world environments constantly encounter adverse weather conditions that are not previously observed. Therefore, it practically requires adverse weather removal models to continually learn from incrementally collected data reflecting various degeneration types. Existing adverse weather removal approaches, for either single or multiple adverse weathers, are mainly designed for a static learning paradigm, which assumes that the data of all types of degenerations to handle can be finely collected at one time before a single-phase learning process. They thus cannot directly handle the incremental learning requirements. To address this issue, we made the earliest effort to investigate the continual all-in-one adverse weather removal task, in a setting closer to real-world applications. Specifically, we develop a novel continual learning framework with effective knowledge replay (KR) on a unified network structure. Equipped with a principal component projection and an effective knowledge distillation mechanism, the proposed KR techniques are tailored for the all-in-one weather removal task. It considers the characteristics of the image restoration task with multiple degenerations in continual learning, and the knowledge for different degenerations can be shared and accumulated in the unified network structure. Extensive experimental results demonstrate the effectiveness of the proposed method to deal with this challenging task, which performs competitively to existing dedicated or joint training image restoration methods. Our code is available at https://github.com/xiaojihh/CL_all-in-one.

CVNov 1, 2025
CoT-Saliency: Unified Chain-of-Thought Reasoning for Heterogeneous Saliency Tasks

Long Li, Shuichen Ji, Ziyang Luo et al.

We present the first unified framework that jointly handles three operationally heterogeneous saliency tasks, eg, SOD, CoSOD, and SIS, by casting each as a Chain-of-Thought (CoT) reasoning process in a Vision-Language Model (VLM) to bridge task heterogeneity. CoT training follows a two-stage paradigm: Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL). To enhance CoT quality in RL, we propose Confidence-Guided Policy Optimization (CGPO), a lightweight single-sample algorithm that leverages the discrepancy between reward and model confidence as a per-sample advantage signal. This design naturally focuses updates on informative responses while eliminating group sampling, thereby addressing GRPO's key limitations: confidence-agnostic learning, signal dilution, and prohibitive computational overhead. We also introduce an "output-to-reasoning" strategy to construct high-fidelity SFT data that ensures logical consistency with ground-truth masks. Experiments show our model matches or outperforms specialized SOTA methods and strong closed-source VLMs across all tasks, especially achieving an S-measure of 0.899 on CoCA for CoSOD, surpassing the prior best by 8.0 percentage points, despite using far less training data.

CVNov 5, 2024Code
Generative Artificial Intelligence Meets Synthetic Aperture Radar: A Survey

Zhongling Huang, Xidan Zhang, Zuqian Tang et al.

SAR images possess unique attributes that present challenges for both human observers and vision AI models to interpret, owing to their electromagnetic characteristics. The interpretation of SAR images encounters various hurdles, with one of the primary obstacles being the data itself, which includes issues related to both the quantity and quality of the data. The challenges can be addressed using generative AI technologies. Generative AI, often known as GenAI, is a very advanced and powerful technology in the field of artificial intelligence that has gained significant attention. The advancement has created possibilities for the creation of texts, photorealistic pictures, videos, and material in various modalities. This paper aims to comprehensively investigate the intersection of GenAI and SAR. First, we illustrate the common data generation-based applications in SAR field and compare them with computer vision tasks, analyzing the similarity, difference, and general challenges of them. Then, an overview of the latest GenAI models is systematically reviewed, including various basic models and their variations targeting the general challenges. Additionally, the corresponding applications in SAR domain are also included. Specifically, we propose to summarize the physical model based simulation approaches for SAR, and analyze the hybrid modeling methods that combine the GenAI and interpretable models. The evaluation methods that have been or could be applied to SAR, are also explored. Finally, the potential challenges and future prospects are discussed. To our best knowledge, this survey is the first exhaustive examination of the interdiscipline of SAR and GenAI, encompassing a wide range of topics, including deep neural networks, physical models, computer vision, and SAR images. The resources of this survey are open-source at \url{https://github.com/XAI4SAR/GenAIxSAR}.

CVDec 24, 2025
Granular Ball Guided Masking: Structure-aware Data Augmentation

Shuyin Xia, Fan Chen, Dawei Dai et al.

Deep learning models have achieved remarkable success in computer vision but still rely heavily on large-scale labeled data and tend to overfit when data is limited or distributions shift. Data augmentation -- particularly mask-based information dropping -- can enhance robustness by forcing models to explore complementary cues; however, existing approaches often lack structural awareness and risk discarding essential semantics. We propose Granular Ball Guided Masking (GBGM), a structure-aware augmentation strategy guided by Granular Ball Computing (GBC). GBGM adaptively preserves semantically rich, structurally important regions while suppressing redundant areas through a coarse-to-fine hierarchical masking process, producing augmentations that are both representative and discriminative. Extensive experiments on multiple benchmarks demonstrate consistent improvements not only in image classification and masked image reconstruction, but also in image tampering detection, validating the effectiveness and generalization of GBGM across both recognition and forensic scenarios. Simple and model-agnostic, GBGM integrates seamlessly into CNNs and Vision Transformers, offering a practical paradigm for structure-aware data augmentation.

29.2CVMar 18
Does YOLO Really Need to See Every Training Image in Every Epoch?

Xingxing Xie, Jiahua Dong, Junwei Han et al.

YOLO detectors are known for their fast inference speed, yet training them remains unexpectedly time-consuming due to their exhaustive pipeline that processes every training image in every epoch, even when many images have already been sufficiently learned. This stands in clear contrast to the efficiency suggested by the ``You Only Look Once'' philosophy. This naturally raises an important question: \textit{Does YOLO really need to see every training image in every epoch?} To explore this, we propose an Anti-Forgetting Sampling Strategy (AFSS) that dynamically determines which images should be used and which can be skipped during each epoch, allowing the detector to learn more effectively and efficiently. Specifically, AFSS measures the learning sufficiency of each training image as the minimum of its detection recall and precision, and dynamically categorizes training images into easy, medium, or hard levels accordingly. Easy training images are sparsely resampled during training in a continuous review manner, with priority given to those that have not been used for a long time to reduce redundancy and prevent forgetting. Moderate training images are partially selected, prioritizing recently unused ones and randomly choosing the rest from unselected images to ensure coverage and prevent forgetting. Hard training images are fully sampled in every epoch to ensure sufficient learning. The learning sufficiency of each training image is periodically updated, enabling detectors to adaptively shift its focus toward the informative training images over time while progressively discarding redundant ones. On widely used natural image detection benchmarks (MS COCO 2017 and PASCAL VOC 2007) and remote sensing detection datasets (DOTA-v1.0 and DIOR-R), AFSS achieves more than $1.43\times$ training speedup for YOLO-series detectors while also improving accuracy.

CVMay 22, 2024Code
Unsupervised Pre-training with Language-Vision Prompts for Low-Data Instance Segmentation

Dingwen Zhang, Hao Li, Diqi He et al.

In recent times, following the paradigm of DETR (DEtection TRansformer), query-based end-to-end instance segmentation (QEIS) methods have exhibited superior performance compared to CNN-based models, particularly when trained on large-scale datasets. Nevertheless, the effectiveness of these QEIS methods diminishes significantly when confronted with limited training data. This limitation arises from their reliance on substantial data volumes to effectively train the pivotal queries/kernels that are essential for acquiring localization and shape priors. To address this problem, we propose a novel method for unsupervised pre-training in low-data regimes. Inspired by the recently successful prompting technique, we introduce a new method, Unsupervised Pre-training with Language-Vision Prompts (UPLVP), which improves QEIS models' instance segmentation by bringing language-vision prompts to queries/kernels. Our method consists of three parts: (1) Masks Proposal: Utilizes language-vision models to generate pseudo masks based on unlabeled images. (2) Prompt-Kernel Matching: Converts pseudo masks into prompts and injects the best-matched localization and shape features to their corresponding kernels. (3) Kernel Supervision: Formulates supervision for pre-training at the kernel level to ensure robust learning. With the help of our pre-training method, QEIS models can converge faster and perform better than CNN-based models in low-data regimes. Experimental evaluations conducted on MS COCO, Cityscapes, and CTW1500 datasets indicate that the QEIS models' performance can be significantly improved when pre-trained with our method. Code will be available at: https://github.com/lifuguan/UPLVP.

79.8CVApr 16
Chain of Modality: From Static Fusion to Dynamic Orchestration in Omni-MLLMs

Ziyang Luo, Nian Liu, Junwei Han

Omni-modal Large Language Models (Omni-MLLMs) promise a unified integration of diverse sensory streams. However, recent evaluations reveal a critical performance paradox: unimodal baselines frequently outperform joint multimodal inference. We trace this perceptual fragility to the static fusion topologies universally employed by current models, identifying two structural pathologies: positional bias in sequential inputs and alignment traps in interleaved formats, which systematically distort attention regardless of task semantics. To resolve this functional rigidity, we propose Chain of Modality (CoM), an agentic framework that transitions multimodal fusion from passive concatenation to dynamic orchestration. CoM adaptively orchestrates input topologies, switching among parallel, sequential, and interleaved pathways to neutralize structural biases. Furthermore, CoM bifurcates cognitive execution into two task-aligned pathways: a streamlined ``Direct-Decide'' path for direct perception and a structured ``Reason-Decide'' path for analytical auditing. Operating in either a training-free or a data-efficient SFT setting, CoM achieves robust and consistent generalization across diverse benchmarks.

CVJul 28, 2024
X-Fake: Juggling Utility Evaluation and Explanation of Simulated SAR Images

Zhongling Huang, Yihan Zhuang, Zipei Zhong et al.

SAR image simulation has attracted much attention due to its great potential to supplement the scarce training data for deep learning algorithms. Consequently, evaluating the quality of the simulated SAR image is crucial for practical applications. The current literature primarily uses image quality assessment techniques for evaluation that rely on human observers' perceptions. However, because of the unique imaging mechanism of SAR, these techniques may produce evaluation results that are not entirely valid. The distribution inconsistency between real and simulated data is the main obstacle that influences the utility of simulated SAR images. To this end, we propose a novel trustworthy utility evaluation framework with a counterfactual explanation for simulated SAR images for the first time, denoted as X-Fake. It unifies a probabilistic evaluator and a causal explainer to achieve a trustworthy utility assessment. We construct the evaluator using a probabilistic Bayesian deep model to learn the posterior distribution, conditioned on real data. Quantitatively, the predicted uncertainty of simulated data can reflect the distribution discrepancy. We build the causal explainer with an introspective variational auto-encoder to generate high-resolution counterfactuals. The latent code of IntroVAE is finally optimized with evaluation indicators and prior information to generate the counterfactual explanation, thus revealing the inauthentic details of simulated data explicitly. The proposed framework is validated on four simulated SAR image datasets obtained from electromagnetic models and generative artificial intelligence approaches. The results demonstrate the proposed X-Fake framework outperforms other IQA methods in terms of utility. Furthermore, the results illustrate that the generated counterfactual explanations are trustworthy, and can further improve the data utility in applications.

NCFeb 13, 2025Code
Brain-Inspired Exploration of Functional Networks and Key Neurons in Large Language Models

Yiheng Liu, Xiaohui Gao, Haiyang Sun et al.

In recent years, the rapid advancement of large language models (LLMs) in natural language processing has sparked significant interest among researchers to understand their mechanisms and functional characteristics. Although existing studies have attempted to explain LLM functionalities by identifying and interpreting specific neurons, these efforts mostly focus on individual neuron contributions, neglecting the fact that human brain functions are realized through intricate interaction networks. Inspired by cognitive neuroscience research on functional brain networks (FBNs), this study introduces a novel approach to investigate whether similar functional networks exist within LLMs. We use methods similar to those in the field of functional neuroimaging analysis to locate and identify functional networks in LLM. Experimental results show that, similar to the human brain, LLMs contain functional networks that frequently recur during operation. Further analysis shows that these functional networks are crucial for LLM performance. Masking key functional networks significantly impairs the model's performance, while retaining just a subset of these networks is adequate to maintain effective operation. This research provides novel insights into the interpretation of LLMs and the lightweighting of LLMs for certain downstream tasks. Code is available at https://github.com/WhatAboutMyStar/LLM_ACTIVATION.

CVNov 19, 2024Code
Physics-Guided Detector for SAR Airplanes

Zhongling Huang, Long Liu, Shuxin Yang et al.

The disperse structure distributions (discreteness) and variant scattering characteristics (variability) of SAR airplane targets lead to special challenges of object detection and recognition. The current deep learning-based detectors encounter challenges in distinguishing fine-grained SAR airplanes against complex backgrounds. To address it, we propose a novel physics-guided detector (PGD) learning paradigm for SAR airplanes that comprehensively investigate their discreteness and variability to improve the detection performance. It is a general learning paradigm that can be extended to different existing deep learning-based detectors with "backbone-neck-head" architectures. The main contributions of PGD include the physics-guided self-supervised learning, feature enhancement, and instance perception, denoted as PGSSL, PGFE, and PGIP, respectively. PGSSL aims to construct a self-supervised learning task based on a wide range of SAR airplane targets that encodes the prior knowledge of various discrete structure distributions into the embedded space. Then, PGFE enhances the multi-scale feature representation of a detector, guided by the physics-aware information learned from PGSSL. PGIP is constructed at the detection head to learn the refined and dominant scattering point of each SAR airplane instance, thus alleviating the interference from the complex background. We propose two implementations, denoted as PGD and PGD-Lite, and apply them to various existing detectors with different backbones and detection heads. The experiments demonstrate the flexibility and effectiveness of the proposed PGD, which can improve existing detectors on SAR airplane detection with fine-grained classification task (an improvement of 3.1\% mAP most), and achieve the state-of-the-art performance (90.7\% mAP) on SAR-AIRcraft-1.0 dataset. The project is open-source at \url{https://github.com/XAI4SAR/PGD}.

CVOct 17, 2025Code
Hierarchical Mixing Architecture for Low-light RAW Image Enhancement

Xianmin Chen, Peiliang Huang, Longfei Han et al.

With the rapid development of deep learning, low-light RAW image enhancement (LLRIE) has achieved remarkable progress. However, the challenge that how to simultaneously achieve strong enhancement quality and high efficiency still remains. Leveraging the inherent efficiency of Channel Attention and Mamba, we introduce a Hierarchical Mixing Architecture (HiMA), a hybrid LLRIE framework built upon two core modules. Specifically, we introduce Large Scale Block (LSB) for upper layers and Small Scale Block (SSB) for lower layers that reduce the parameters while improve the performance. Based on this framework, we also introduce a novel Local Distribution Adjustment (LoDA) module that adaptively aligns local feature statistics in a content-aware manner by learning to adjust regional luminance and contrast distributions. Moreover, to alleviate the domain ambiguity commonly observed in existing LLRIE pipelines, we design a Multi-Prior Fusion (MPF) module that leverages three complementary priors extracted from the first stage of the hybrid architecture to maintain domain consistency. Extensive experiments on multiple public benchmarks demonstrate that our approach outperforms state-of-the-art methods, delivering superior performance with fewer parameters. Code is available at https://github.com/Cynicarlos/HiMA.

CVAug 15, 2025Code
Multi-State Tracker: Enhancing Efficient Object Tracking via Multi-State Specialization and Interaction

Shilei Wang, Gong Cheng, Pujian Lai et al.

Efficient trackers achieve faster runtime by reducing computational complexity and model parameters. However, this efficiency often compromises the expense of weakened feature representation capacity, thus limiting their ability to accurately capture target states using single-layer features. To overcome this limitation, we propose Multi-State Tracker (MST), which utilizes highly lightweight state-specific enhancement (SSE) to perform specialized enhancement on multi-state features produced by multi-state generation (MSG) and aggregates them in an interactive and adaptive manner using cross-state interaction (CSI). This design greatly enhances feature representation while incurring minimal computational overhead, leading to improved tracking robustness in complex environments. Specifically, the MSG generates multiple state representations at multiple stages during feature extraction, while SSE refines them to highlight target-specific features. The CSI module facilitates information exchange between these states and ensures the integration of complementary features. Notably, the introduced SSE and CSI modules adopt a highly lightweight hidden state adaptation-based state space duality (HSA-SSD) design, incurring only 0.1 GFLOPs in computation and 0.66 M in parameters. Experimental results demonstrate that MST outperforms all previous efficient trackers across multiple datasets, significantly improving tracking accuracy and robustness. In particular, it shows excellent runtime performance, with an AO score improvement of 4.5\% over the previous SOTA efficient tracker HCAT on the GOT-10K dataset. The code is available at https://github.com/wsumel/MST.

CLAug 7, 2025Code
Pruning Large Language Models by Identifying and Preserving Functional Networks

Yiheng Liu, Junhao Ning, Sichen Xia et al.

Structured pruning is one of the representative techniques for compressing large language models (LLMs) to reduce GPU memory consumption and accelerate inference speed. It offers significant practical value in improving the efficiency of LLMs in real-world applications. Current structured pruning methods typically rely on assessment of the importance of the structure units and pruning the units with less importance. Most of them overlooks the interaction and collaboration among artificial neurons that are crucial for the functionalities of LLMs, leading to a disruption in the macro functional architecture of LLMs and consequently a pruning performance degradation. Inspired by the inherent similarities between artificial neural networks and functional neural networks in the human brain, we alleviate this challenge and propose to prune LLMs by identifying and preserving functional networks within LLMs in this study. To achieve this, we treat an LLM as a digital brain and decompose the LLM into functional networks, analogous to identifying functional brain networks in neuroimaging data. Afterwards, an LLM is pruned by preserving the key neurons within these functional networks. Experimental results demonstrate that the proposed method can successfully identify and locate functional networks and key neurons in LLMs, enabling efficient model pruning. Our code is available at https://github.com/WhatAboutMyStar/LLM_ACTIVATION.

IVJun 30, 2024Code
ASPS: Augmented Segment Anything Model for Polyp Segmentation

Huiqian Li, Dingwen Zhang, Jieru Yao et al.

Polyp segmentation plays a pivotal role in colorectal cancer diagnosis. Recently, the emergence of the Segment Anything Model (SAM) has introduced unprecedented potential for polyp segmentation, leveraging its powerful pre-training capability on large-scale datasets. However, due to the domain gap between natural and endoscopy images, SAM encounters two limitations in achieving effective performance in polyp segmentation. Firstly, its Transformer-based structure prioritizes global and low-frequency information, potentially overlooking local details, and introducing bias into the learned features. Secondly, when applied to endoscopy images, its poor out-of-distribution (OOD) performance results in substandard predictions and biased confidence output. To tackle these challenges, we introduce a novel approach named Augmented SAM for Polyp Segmentation (ASPS), equipped with two modules: Cross-branch Feature Augmentation (CFA) and Uncertainty-guided Prediction Regularization (UPR). CFA integrates a trainable CNN encoder branch with a frozen ViT encoder, enabling the integration of domain-specific knowledge while enhancing local features and high-frequency details. Moreover, UPR ingeniously leverages SAM's IoU score to mitigate uncertainty during the training procedure, thereby improving OOD performance and domain generalization. Extensive experimental results demonstrate the effectiveness and utility of the proposed method in improving SAM's performance in polyp segmentation. Our code is available at https://github.com/HuiqianLi/ASPS.

CVDec 17, 2024Code
PolSAM: Polarimetric Scattering Mechanism Informed Segment Anything Model

Yuqing Wang, Zhongling Huang, Shuxin Yang et al.

PolSAR data presents unique challenges due to its rich and complex characteristics. Existing data representations, such as complex-valued data, polarimetric features, and amplitude images, are widely used. However, these formats often face issues related to usability, interpretability, and data integrity. Most feature extraction networks for PolSAR are small, limiting their ability to capture features effectively. To address these issues, We propose the Polarimetric Scattering Mechanism-Informed SAM (PolSAM), an enhanced Segment Anything Model (SAM) that integrates domain-specific scattering characteristics and a novel prompt generation strategy. PolSAM introduces Microwave Vision Data (MVD), a lightweight and interpretable data representation derived from polarimetric decomposition and semantic correlations. We propose two key components: the Feature-Level Fusion Prompt (FFP), which fuses visual tokens from pseudo-colored SAR images and MVD to address modality incompatibility in the frozen SAM encoder, and the Semantic-Level Fusion Prompt (SFP), which refines sparse and dense segmentation prompts using semantic information. Experimental results on the PhySAR-Seg datasets demonstrate that PolSAM significantly outperforms existing SAM-based and multimodal fusion models, improving segmentation accuracy, reducing data storage, and accelerating inference time. The source code and datasets will be made publicly available at https://github.com/XAI4SAR/PolSAM.

CVDec 17, 2021Code
Pixel Distillation: A New Knowledge Distillation Scheme for Low-Resolution Image Recognition

Guangyu Guo, Dingwen Zhang, Longfei Han et al.

Previous knowledge distillation (KD) methods mostly focus on compressing network architectures, which is not thorough enough in deployment as some costs like transmission bandwidth and imaging equipment are related to the image size. Therefore, we propose Pixel Distillation that extends knowledge distillation into the input level while simultaneously breaking architecture constraints. Such a scheme can achieve flexible cost control for deployment, as it allows the system to adjust both network architecture and image quality according to the overall requirement of resources. Specifically, we first propose an input spatial representation distillation (ISRD) mechanism to transfer spatial knowledge from large images to student's input module, which can facilitate stable knowledge transfer between CNN and ViT. Then, a Teacher-Assistant-Student (TAS) framework is further established to disentangle pixel distillation into the model compression stage and input compression stage, which significantly reduces the overall complexity of pixel distillation and the difficulty of distilling intermediate knowledge. Finally, we adapt pixel distillation to object detection via an aligned feature for preservation (AFP) strategy for TAS, which aligns output dimensions of detectors at each stage by manipulating features and anchors of the assistant. Comprehensive experiments on image classification and object detection demonstrate the effectiveness of our method. Code is available at https://github.com/gyguo/PixelDistillation.

CVNov 24, 2021Code
Background-Click Supervision for Temporal Action Localization

Le Yang, Junwei Han, Tao Zhao et al.

Weakly supervised temporal action localization aims at learning the instance-level action pattern from the video-level labels, where a significant challenge is action-context confusion. To overcome this challenge, one recent work builds an action-click supervision framework. It requires similar annotation costs but can steadily improve the localization performance when compared to the conventional weakly supervised methods. In this paper, by revealing that the performance bottleneck of the existing approaches mainly comes from the background errors, we find that a stronger action localizer can be trained with labels on the background video frames rather than those on the action frames. To this end, we convert the action-click supervision to the background-click supervision and develop a novel method, called BackTAL. Specifically, BackTAL implements two-fold modeling on the background video frames, i.e. the position modeling and the feature modeling. In position modeling, we not only conduct supervised learning on the annotated video frames but also design a score separation module to enlarge the score differences between the potential action frames and backgrounds. In feature modeling, we propose an affinity module to measure frame-specific similarities among neighboring frames and dynamically attend to informative neighbors when calculating temporal convolution. Extensive experiments on three benchmarks are conducted, which demonstrate the high performance of the established BackTAL and the rationality of the proposed background-click supervision. Code is available at https://github.com/VividLe/BackTAL.

CVOct 5, 2021Code
Anchor-free Oriented Proposal Generator for Object Detection

Gong Cheng, Jiabao Wang, Ke Li et al.

Oriented object detection is a practical and challenging task in remote sensing image interpretation. Nowadays, oriented detectors mostly use horizontal boxes as intermedium to derive oriented boxes from them. However, the horizontal boxes are inclined to get small Intersection-over-Unions (IoUs) with ground truths, which may have some undesirable effects, such as introducing redundant noise, mismatching with ground truths, detracting from the robustness of detectors, etc. In this paper, we propose a novel Anchor-free Oriented Proposal Generator (AOPG) that abandons horizontal box-related operations from the network architecture. AOPG first produces coarse oriented boxes by a Coarse Location Module (CLM) in an anchor-free manner and then refines them into high-quality oriented proposals. After AOPG, we apply a Fast R-CNN head to produce the final detection results. Furthermore, the shortage of large-scale datasets is also a hindrance to the development of oriented object detection. To alleviate the data insufficiency, we release a new dataset on the basis of our DIOR dataset and name it DIOR-R. Massive experiments demonstrate the effectiveness of AOPG. Particularly, without bells and whistles, we achieve the accuracy of 64.41%, 75.24% and 96.22% mAP on the DIOR-R, DOTA and HRSC2016 datasets respectively. Code and models are available at https://github.com/jbwang1997/AOPG.

CVOct 1, 2021Code
Summarize and Search: Learning Consensus-aware Dynamic Convolution for Co-Saliency Detection

Ni Zhang, Junwei Han, Nian Liu et al.

Humans perform co-saliency detection by first summarizing the consensus knowledge in the whole group and then searching corresponding objects in each image. Previous methods usually lack robustness, scalability, or stability for the first process and simply fuse consensus features with image features for the second process. In this paper, we propose a novel consensus-aware dynamic convolution model to explicitly and effectively perform the "summarize and search" process. To summarize consensus image features, we first summarize robust features for every single image using an effective pooling method and then aggregate cross-image consensus cues via the self-attention mechanism. By doing this, our model meets the scalability and stability requirements. Next, we generate dynamic kernels from consensus features to encode the summarized consensus knowledge. Two kinds of kernels are generated in a supplementary way to summarize fine-grained image-specific consensus object cues and the coarse group-wise common knowledge, respectively. Then, we can effectively perform object searching by employing dynamic convolution at multiple scales. Besides, a novel and effective data synthesis method is also proposed to train our network. Experimental results on four benchmark datasets verify the effectiveness of our proposed method. Our code and saliency maps are available at \url{https://github.com/nnizhang/CADC}.