Shizhou Zhang

CV
h-index18
30papers
634citations
Novelty55%
AI Score59

30 Papers

CVAug 17, 2022Code
Dual Modality Prompt Tuning for Vision-Language Pre-Trained Model

Yinghui Xing, Qirui Wu, De Cheng et al.

With the emergence of large pre-trained vison-language model like CLIP, transferable representations can be adapted to a wide range of downstream tasks via prompt tuning. Prompt tuning tries to probe the beneficial information for downstream tasks from the general knowledge stored in the pre-trained model. A recently proposed method named Context Optimization (CoOp) introduces a set of learnable vectors as text prompt from the language side. However, tuning the text prompt alone can only adjust the synthesized "classifier", while the computed visual features of the image encoder can not be affected , thus leading to sub-optimal solutions. In this paper, we propose a novel Dual-modality Prompt Tuning (DPT) paradigm through learning text and visual prompts simultaneously. To make the final image feature concentrate more on the target visual concept, a Class-Aware Visual Prompt Tuning (CAVPT) scheme is further proposed in our DPT, where the class-aware visual prompt is generated dynamically by performing the cross attention between text prompts features and image patch token embeddings to encode both the downstream task-related information and visual instance information. Extensive experimental results on 11 datasets demonstrate the effectiveness and generalization ability of the proposed method. Our code is available in https://github.com/fanrena/DPT.

CVFeb 1, 2023Code
MS-DETR: Multispectral Pedestrian Detection Transformer with Loosely Coupled Fusion and Modality-Balanced Optimization

Yinghui Xing, Shuo Yang, Song Wang et al.

Multispectral pedestrian detection is an important task for many around-the-clock applications, since the visible and thermal modalities can provide complementary information especially under low light conditions. Due to the presence of two modalities, misalignment and modality imbalance are the most significant issues in multispectral pedestrian detection. In this paper, we propose M ulti S pectral pedestrian DE tection TR ansformer (MS-DETR) to fix above issues. MS-DETR consists of two modality-specific backbones and Transformer encoders, followed by a multi-modal Transformer decoder, and the visible and thermal features are fused in the multi-modal Transformer decoder. To well resist the misalignment between multi-modal images, we design a loosely coupled fusion strategy by sparsely sampling some keypoints from multi-modal features independently and fusing them with adaptively learned attention weights. Moreover, based on the insight that not only different modalities, but also different pedestrian instances tend to have different confidence scores to final detection, we further propose an instance-aware modality-balanced optimization strategy, which preserves visible and thermal decoder branches and aligns their predicted slots through an instance-wise dynamic loss. Our end-to-end MS-DETR shows superior performance on the challenging KAIST, CVC-14 and LLVIP benchmark datasets. The source code is available at https://github.com/YinghuiXing/MS-DETR.

CVAug 24, 2023Code
Ground-to-Aerial Person Search: Benchmark Dataset and Approach

Shizhou Zhang, Qingchun Yang, De Cheng et al.

In this work, we construct a large-scale dataset for Ground-to-Aerial Person Search, named G2APS, which contains 31,770 images of 260,559 annotated bounding boxes for 2,644 identities appearing in both of the UAVs and ground surveillance cameras. To our knowledge, this is the first dataset for cross-platform intelligent surveillance applications, where the UAVs could work as a powerful complement for the ground surveillance cameras. To more realistically simulate the actual cross-platform Ground-to-Aerial surveillance scenarios, the surveillance cameras are fixed about 2 meters above the ground, while the UAVs capture videos of persons at different location, with a variety of view-angles, flight attitudes and flight modes. Therefore, the dataset has the following unique characteristics: 1) drastic view-angle changes between query and gallery person images from cross-platform cameras; 2) diverse resolutions, poses and views of the person images under 9 rich real-world scenarios. On basis of the G2APS benchmark dataset, we demonstrate detailed analysis about current two-step and end-to-end person search methods, and further propose a simple yet effective knowledge distillation scheme on the head of the ReID network, which achieves state-of-the-art performances on both of the G2APS and the previous two public person search datasets, i.e., PRW and CUHK-SYSU. The dataset and source code available on \url{https://github.com/yqc123456/HKD_for_person_search}.

CVDec 16, 2022
Weakly Supervised Video Anomaly Detection Based on Cross-Batch Clustering Guidance

Congqi Cao, Xin Zhang, Shizhou Zhang et al.

Weakly supervised video anomaly detection (WSVAD) is a challenging task since only video-level labels are available for training. In previous studies, the discriminative power of the learned features is not strong enough, and the data imbalance resulting from the mini-batch training strategy is ignored. To address these two issues, we propose a novel WSVAD method based on cross-batch clustering guidance. To enhance the discriminative power of features, we propose a batch clustering based loss to encourage a clustering branch to generate distinct normal and abnormal clusters based on a batch of data. Meanwhile, we design a cross-batch learning strategy by introducing clustering results from previous mini-batches to reduce the impact of data imbalance. In addition, we propose to generate more accurate segment-level anomaly scores based on batch clustering guidance further improving the performance of WSVAD. Extensive experiments on two public datasets demonstrate the effectiveness of our approach.

CVDec 4, 2025Code
DuGI-MAE: Improving Infrared Mask Autoencoders via Dual-Domain Guidance

Yinghui Xing, Xiaoting Su, Shizhou Zhang et al.

Infrared imaging plays a critical role in low-light and adverse weather conditions. However, due to the distinct characteristics of infrared images, existing foundation models such as Masked Autoencoder (MAE) trained on visible data perform suboptimal in infrared image interpretation tasks. To bridge this gap, an infrared foundation model known as InfMAE was developed and pre-trained on large-scale infrared datasets. Despite its effectiveness, InfMAE still faces several limitations, including the omission of informative tokens, insufficient modeling of global associations, and neglect of non-uniform noise. In this paper, we propose a Dual-domain Guided Infrared foundation model based on MAE (DuGI-MAE). First, we design a deterministic masking strategy based on token entropy, preserving only high-entropy tokens for reconstruction to enhance informativeness. Next, we introduce a Dual-Domain Guidance (DDG) module, which simultaneously captures global token relationships and adaptively filters non-uniform background noise commonly present in infrared imagery. To facilitate large-scale pretraining, we construct Inf-590K, a comprehensive infrared image dataset encompassing diverse scenes, various target types, and multiple spatial resolutions. Pretrained on Inf-590K, DuGI-MAE demonstrates strong generalization capabilities across various downstream tasks, including infrared object detection, semantic segmentation, and small target detection. Experimental results validate the superiority of the proposed method over both supervised and self-supervised comparison methods. Our code is available in the supplementary material.

CVDec 5, 2022
Generalizable Person Re-Identification via Viewpoint Alignment and Fusion

Bingliang Jiao, Lingqiao Liu, Liying Gao et al.

In the current person Re-identification (ReID) methods, most domain generalization works focus on dealing with style differences between domains while largely ignoring unpredictable camera view change, which we identify as another major factor leading to a poor generalization of ReID methods. To tackle the viewpoint change, this work proposes to use a 3D dense pose estimation model and a texture mapping module to map the pedestrian images to canonical view images. Due to the imperfection of the texture mapping module, the canonical view images may lose the discriminative detail clues from the original images, and thus directly using them for ReID will inevitably result in poor performance. To handle this issue, we propose to fuse the original image and canonical view image via a transformer-based module. The key insight of this design is that the cross-attention mechanism in the transformer could be an ideal solution to align the discriminative texture clues from the original image with the canonical view image, which could compensate for the low-quality texture information of the canonical view image. Through extensive experiments, we show that our method can lead to superior performance over the existing approaches in various evaluation settings.

CVNov 9, 2025Code
On Modality Incomplete Infrared-Visible Object Detection: An Architecture Compatibility Perspective

Shuo Yang, Yinghui Xing, Shizhou Zhang et al.

Infrared and visible object detection (IVOD) is essential for numerous around-the-clock applications. Despite notable advancements, current IVOD models exhibit notable performance declines when confronted with incomplete modality data, particularly if the dominant modality is missing. In this paper, we take a thorough investigation on modality incomplete IVOD problem from an architecture compatibility perspective. Specifically, we propose a plug-and-play Scarf Neck module for DETR variants, which introduces a modality-agnostic deformable attention mechanism to enable the IVOD detector to flexibly adapt to any single or double modalities during training and inference. When training Scarf-DETR, we design a pseudo modality dropout strategy to fully utilize the multi-modality information, making the detector compatible and robust to both working modes of single and double modalities. Moreover, we introduce a comprehensive benchmark for the modality-incomplete IVOD task aimed at thoroughly assessing situations where the absent modality is either dominant or secondary. Our proposed Scarf-DETR not only performs excellently in missing modality scenarios but also achieves superior performances on the standard IVOD modality complete benchmarks. Our code will be available at https://github.com/YinghuiXing/Scarf-DETR.

CVSep 19, 2024
Frequency-Guided Spatial Adaptation for Camouflaged Object Detection

Shizhou Zhang, Dexuan Kong, Yinghui Xing et al.

Camouflaged object detection (COD) aims to segment camouflaged objects which exhibit very similar patterns with the surrounding environment. Recent research works have shown that enhancing the feature representation via the frequency information can greatly alleviate the ambiguity problem between the foreground objects and the background.With the emergence of vision foundation models, like InternImage, Segment Anything Model etc, adapting the pretrained model on COD tasks with a lightweight adapter module shows a novel and promising research direction. Existing adapter modules mainly care about the feature adaptation in the spatial domain. In this paper, we propose a novel frequency-guided spatial adaptation method for COD task. Specifically, we transform the input features of the adapter into frequency domain. By grouping and interacting with frequency components located within non overlapping circles in the spectrogram, different frequency components are dynamically enhanced or weakened, making the intensity of image details and contour features adaptively adjusted. At the same time, the features that are conducive to distinguishing object and background are highlighted, indirectly implying the position and shape of camouflaged object. We conduct extensive experiments on four widely adopted benchmark datasets and the proposed method outperforms 26 state-of-the-art methods with large margins. Code will be released.

CVAug 14, 2024
Cross-Platform Video Person ReID: A New Benchmark Dataset and Adaptation Approach

Shizhou Zhang, Wenlong Luo, De Cheng et al.

In this paper, we construct a large-scale benchmark dataset for Ground-to-Aerial Video-based person Re-Identification, named G2A-VReID, which comprises 185,907 images and 5,576 tracklets, featuring 2,788 distinct identities. To our knowledge, this is the first dataset for video ReID under Ground-to-Aerial scenarios. G2A-VReID dataset has the following characteristics: 1) Drastic view changes; 2) Large number of annotated identities; 3) Rich outdoor scenarios; 4) Huge difference in resolution. Additionally, we propose a new benchmark approach for cross-platform ReID by transforming the cross-platform visual alignment problem into visual-semantic alignment through vision-language model (i.e., CLIP) and applying a parameter-efficient Video Set-Level-Adapter module to adapt image-based foundation model to video ReID tasks, termed VSLA-CLIP. Besides, to further reduce the great discrepancy across the platforms, we also devise the platform-bridge prompts for efficient visual feature alignment. Extensive experiments demonstrate the superiority of the proposed method on all existing video ReID datasets and our proposed G2A-VReID dataset.

CVJul 20, 2023
Pre-train, Adapt and Detect: Multi-Task Adapter Tuning for Camouflaged Object Detection

Yinghui Xing, Dexuan Kong, Shizhou Zhang et al.

Camouflaged object detection (COD), aiming to segment camouflaged objects which exhibit similar patterns with the background, is a challenging task. Most existing works are dedicated to establishing specialized modules to identify camouflaged objects with complete and fine details, while the boundary can not be well located for the lack of object-related semantics. In this paper, we propose a novel ``pre-train, adapt and detect" paradigm to detect camouflaged objects. By introducing a large pre-trained model, abundant knowledge learned from massive multi-modal data can be directly transferred to COD. A lightweight parallel adapter is inserted to adjust the features suitable for the downstream COD task. Extensive experiments on four challenging benchmark datasets demonstrate that our method outperforms existing state-of-the-art COD models by large margins. Moreover, we design a multi-task learning scheme for tuning the adapter to exploit the shareable knowledge across different semantic classes. Comprehensive experimental results showed that the generalization ability of our model can be substantially improved with multi-task adapter initialization on source tasks and multi-task adaptation on target tasks.

CVDec 28, 2025
YOLO-IOD: Towards Real Time Incremental Object Detection

Shizhou Zhang, Xueqiang Lv, Yinghui Xing et al.

Current methods for incremental object detection (IOD) primarily rely on Faster R-CNN or DETR series detectors; however, these approaches do not accommodate the real-time YOLO detection frameworks. In this paper, we first identify three primary types of knowledge conflicts that contribute to catastrophic forgetting in YOLO-based incremental detectors: foreground-background confusion, parameter interference, and misaligned knowledge distillation. Subsequently, we introduce YOLO-IOD, a real-time Incremental Object Detection (IOD) framework that is constructed upon the pretrained YOLO-World model, facilitating incremental learning via a stage-wise parameter-efficient fine-tuning process. Specifically, YOLO-IOD encompasses three principal components: 1) Conflict-Aware Pseudo-Label Refinement (CPR), which mitigates the foreground-background confusion by leveraging the confidence levels of pseudo labels and identifying potential objects relevant to future tasks. 2) Importancebased Kernel Selection (IKS), which identifies and updates the pivotal convolution kernels pertinent to the current task during the current learning stage. 3) Cross-Stage Asymmetric Knowledge Distillation (CAKD), which addresses the misaligned knowledge distillation conflict by transmitting the features of the student target detector through the detection heads of both the previous and current teacher detectors, thereby facilitating asymmetric distillation between existing and newly introduced categories. We further introduce LoCo COCO, a more realistic benchmark that eliminates data leakage across stages. Experiments on both conventional and LoCo COCO benchmarks show that YOLO-IOD achieves superior performance with minimal forgetting.

CVJan 10, 2024Code
CrossDiff: Exploring Self-Supervised Representation of Pansharpening via Cross-Predictive Diffusion Model

Yinghui Xing, Litao Qu, Shizhou Zhang et al.

Fusion of a panchromatic (PAN) image and corresponding multispectral (MS) image is also known as pansharpening, which aims to combine abundant spatial details of PAN and spectral information of MS. Due to the absence of high-resolution MS images, available deep-learning-based methods usually follow the paradigm of training at reduced resolution and testing at both reduced and full resolution. When taking original MS and PAN images as inputs, they always obtain sub-optimal results due to the scale variation. In this paper, we propose to explore the self-supervised representation of pansharpening by designing a cross-predictive diffusion model, named CrossDiff. It has two-stage training. In the first stage, we introduce a cross-predictive pretext task to pre-train the UNet structure based on conditional DDPM, while in the second stage, the encoders of the UNets are frozen to directly extract spatial and spectral features from PAN and MS, and only the fusion head is trained to adapt for pansharpening task. Extensive experiments show the effectiveness and superiority of the proposed model compared with state-of-the-art supervised and unsupervised methods. Besides, the cross-sensor experiments also verify the generalization ability of proposed self-supervised representation learners for other satellite's datasets. We will release our code for reproducibility.

CVMar 2
Better Matching, Less Forgetting: A Quality-Guided Matcher for Transformer-based Incremental Object Detection

Qirui Wu, Shizhou Zhang, De Cheng et al.

Incremental Object Detection (IOD) aims to continuously learn new object classes without forgetting previously learned ones. A persistent challenge is catastrophic forgetting, primarily attributed to background shift in conventional detectors. While pseudo-labeling mitigates this in dense detectors, we identify a novel, distinct source of forgetting specific to DETR-like architectures: background foregrounding. This arises from the exhaustiveness constraint of the Hungarian matcher, which forcibly assigns every ground truth target to one prediction, even when predictions primarily cover background regions (i.e., low IoU). This erroneous supervision compels the model to misclassify background features as specific foreground classes, disrupting learned representations and accelerating forgetting. To address this, we propose a Quality-guided Min-Cost Max-Flow (Q-MCMF) matcher. To avoid forced assignments, Q-MCMF builds a flow graph and prunes implausible matches based on geometric quality. It then optimizes for the final matching that minimizes cost and maximizes valid assignments. This strategy eliminates harmful supervision from background foregrounding while maximizing foreground learning signals. Extensive experiments on the COCO dataset under various incremental settings demonstrate that our method consistently outperforms existing state-of-the-art approaches.

CVDec 17, 2024Code
ComprehendEdit: A Comprehensive Dataset and Evaluation Framework for Multimodal Knowledge Editing

Yaohui Ma, Xiaopeng Hong, Shizhou Zhang et al.

Large multimodal language models (MLLMs) have revolutionized natural language processing and visual understanding, but often contain outdated or inaccurate information. Current multimodal knowledge editing evaluations are limited in scope and potentially biased, focusing on narrow tasks and failing to assess the impact on in-domain samples. To address these issues, we introduce ComprehendEdit, a comprehensive benchmark comprising eight diverse tasks from multiple datasets. We propose two novel metrics: Knowledge Generalization Index (KGI) and Knowledge Preservation Index (KPI), which evaluate editing effects on in-domain samples without relying on AI-synthetic samples. Based on insights from our framework, we establish Hierarchical In-Context Editing (HICE), a baseline method employing a two-stage approach that balances performance across all metrics. This study provides a more comprehensive evaluation framework for multimodal knowledge editing, reveals unique challenges in this field, and offers a baseline method demonstrating improved performance. Our work opens new perspectives for future research and provides a foundation for developing more robust and effective editing techniques for MLLMs. The ComprehendEdit benchmark and implementation code are available at https://github.com/yaohui120/ComprehendEdit.

CVFeb 24
Knowing the Unknown: Interpretable Open-World Object Detection via Concept Decomposition Model

Xueqiang Lv, Shizhou Zhang, Yinghui Xing et al.

Open-world object detection (OWOD) requires incrementally detecting known categories while reliably identifying unknown objects. Existing methods primarily focus on improving unknown recall, yet overlook interpretability, often leading to known-unknown confusion and reduced prediction reliability. This paper aims to make the entire OWOD framework interpretable, enabling the detector to truly "knowing the unknown". To this end, we propose a concept-driven InterPretable OWOD framework(IPOW) by introducing a Concept Decomposition Model (CDM) for OWOD, which explicitly decomposes the coupled RoI features in Faster R-CNN into discriminative, shared, and background concepts. Discriminative concepts identify the most discriminative features to enlarge the distances between known categories, while shared and background concepts, due to their strong generalization ability, can be readily transferred to detect unknown categories. Leveraging the interpretable framework, we identify that known-unknown confusion arises when unknown objects fall into the discriminative space of known classes. To address this, we propose Concept-Guided Rectification (CGR) to further resolve such confusion. Extensive experiments show that IPOW significantly improves unknown recall while mitigating confusion, and provides concept-level interpretability for both known and unknown predictions.

CVFeb 8, 2025Code
Demystifying Catastrophic Forgetting in Two-Stage Incremental Object Detector

Qirui Wu, Shizhou Zhang, De Cheng et al.

Catastrophic forgetting is a critical chanllenge for incremental object detection (IOD). Most existing methods treat the detector monolithically, relying on instance replay or knowledge distillation without analyzing component-specific forgetting. Through dissection of Faster R-CNN, we reveal a key insight: Catastrophic forgetting is predominantly localized to the RoI Head classifier, while regressors retain robustness across incremental stages. This finding challenges conventional assumptions, motivating us to develop a framework termed NSGP-RePRE. Regional Prototype Replay (RePRE) mitigates classifier forgetting via replay of two types of prototypes: coarse prototypes represent class-wise semantic centers of RoI features, while fine-grained prototypes model intra-class variations. Null Space Gradient Projection (NSGP) is further introduced to eliminate prototype-feature misalignment by updating the feature extractor in directions orthogonal to subspace of old inputs via gradient projection, aligning RePRE with incremental learning dynamics. Our simple yet effective design allows NSGP-RePRE to achieve state-of-the-art performance on the Pascal VOC and MS COCO datasets under various settings. Our work not only advances IOD methodology but also provide pivotal insights for catastrophic forgetting mitigation in IOD. Code is available at \href{https://github.com/fanrena/NSGP-RePRE}{https://github.com/fanrena/NSGP-RePRE} .

CVNov 23, 2024Code
Mamba-CL: Optimizing Selective State Space Model in Null Space for Continual Learning

De Cheng, Yue Lu, Lingfeng He et al.

Continual Learning (CL) aims to equip AI models with the ability to learn a sequence of tasks over time, without forgetting previously learned knowledge. Recently, State Space Models (SSMs), particularly the Mamba model, have achieved notable success in computer vision. Building on the strengths of SSMs, this study explores leveraging the Mamba model for CL. Therefore, we introduce Mamba-CL, a framework that continuously fine-tunes the core SSMs of the large-scale Mamba foundation model by updating parameters orthogonal to the feature subspace of previous tasks. This approach theoretically guarantees the consistency objective aiming to preserves consistent output for each SSM module across both previous and current tasks, so as to overcome catastrophic forgetting issue. Specifically, we achieve this goal by deducing the overall consistency constraints on four key time-invariant parameters in the Mamba model, streamlining its recurrent state-space structure and non-linear discretization process in SSM. In practice, we apply the null-space projection to efficiently implement the orthogonality within Mamba model. Extensive experiments on four class-incremental benchmarks demonstrate the effectiveness of Mamba-CL for anti-forgetting, achieving superior performances to state-of-the-art methods. Code is available in the supplementary materials.

CVFeb 5
Attention Retention for Continual Learning with Vision Transformers

Yue Lu, Xiangyu Zhou, Shizhou Zhang et al.

Continual learning (CL) empowers AI systems to progressively acquire knowledge from non-stationary data streams. However, catastrophic forgetting remains a critical challenge. In this work, we identify attention drift in Vision Transformers as a primary source of catastrophic forgetting, where the attention to previously learned visual concepts shifts significantly after learning new tasks. Inspired by neuroscientific insights into the selective attention in the human visual system, we propose a novel attention-retaining framework to mitigate forgetting in CL. Our method constrains attention drift by explicitly modifying gradients during backpropagation through a two-step process: 1) extracting attention maps of the previous task using a layer-wise rollout mechanism and generating instance-adaptive binary masks, and 2) when learning a new task, applying these masks to zero out gradients associated with previous attention regions, thereby preventing disruption of learned visual concepts. For compatibility with modern optimizers, the gradient masking process is further enhanced by scaling parameter updates proportionally to maintain their relative magnitudes. Experiments and visualizations demonstrate the effectiveness of our method in mitigating catastrophic forgetting and preserving visual concepts. It achieves state-of-the-art performance and exhibits robust generalizability across diverse CL scenarios.

CVJun 9, 2024Code
Visual Prompt Tuning in Null Space for Continual Learning

Yue Lu, Shizhou Zhang, De Cheng et al.

Existing prompt-tuning methods have demonstrated impressive performances in continual learning (CL), by selecting and updating relevant prompts in the vision-transformer models. On the contrary, this paper aims to learn each task by tuning the prompts in the direction orthogonal to the subspace spanned by previous tasks' features, so as to ensure no interference on tasks that have been learned to overcome catastrophic forgetting in CL. However, different from the orthogonal projection in the traditional CNN architecture, the prompt gradient orthogonal projection in the ViT architecture shows completely different and greater challenges, i.e., 1) the high-order and non-linear self-attention operation; 2) the drift of prompt distribution brought by the LayerNorm in the transformer block. Theoretically, we have finally deduced two consistency conditions to achieve the prompt gradient orthogonal projection, which provide a theoretical guarantee of eliminating interference on previously learned knowledge via the self-attention mechanism in visual prompt tuning. In practice, an effective null-space-based approximation solution has been proposed to implement the prompt gradient orthogonal projection. Extensive experimental results demonstrate the effectiveness of anti-forgetting on four class-incremental benchmarks with diverse pre-trained baseline models, and our approach achieves superior performances to state-of-the-art methods. Our code is available at https://github.com/zugexiaodui/VPTinNSforCL.

CVAug 14, 2019Code
Person Re-identification in Aerial Imagery

Shizhou Zhang, Qi Zhang, Yifei Yang et al.

Nowadays, with the rapid development of consumer Unmanned Aerial Vehicles (UAVs), visual surveillance by utilizing the UAV platform has been very attractive. Most of the research works for UAV captured visual data are mainly focused on the tasks of object detection and tracking. However, limited attention has been paid to the task of person Re-identification (ReID) which has been widely studied in ordinary surveillance cameras with fixed emplacements. In this paper, to facilitate the research of person ReID in aerial imagery, we collect a large scale airborne person ReID dataset named as Person ReID for Aerial Imagery (PRAI-1581), which consists of 39,461 images of 1581 person identities. The images of the dataset are shot by two DJI consumer UAVs flying at an altitude ranging from 20 to 60 meters above the ground, which covers most of the real UAV surveillance scenarios. In addition, we propose to utilize subspace pooling of convolution feature maps to represent the input person images. Our method can learn a discriminative and compact feature representation for ReID in aerial imagery and can be trained in an end-to-end fashion efficiently. We conduct extensive experiments on the proposed dataset and the experimental results demonstrate that re-identify persons in aerial imagery is a challenging problem, where our method performs favorably against state of the arts. Our dataset can be accessed via \url{https://github.com/stormyoung/PRAI-1581}.

LGAug 12, 2025
Multi-level Collaborative Distillation Meets Global Workspace Model: A Unified Framework for OCIL

Shibin Su, Guoqiang Liang, De Cheng et al.

Online Class-Incremental Learning (OCIL) enables models to learn continuously from non-i.i.d. data streams and samples of the data streams can be seen only once, making it more suitable for real-world scenarios compared to offline learning. However, OCIL faces two key challenges: maintaining model stability under strict memory constraints and ensuring adaptability to new tasks. Under stricter memory constraints, current replay-based methods are less effective. While ensemble methods improve adaptability (plasticity), they often struggle with stability. To overcome these challenges, we propose a novel approach that enhances ensemble learning through a Global Workspace Model (GWM)-a shared, implicit memory that guides the learning of multiple student models. The GWM is formed by fusing the parameters of all students within each training batch, capturing the historical learning trajectory and serving as a dynamic anchor for knowledge consolidation. This fused model is then redistributed periodically to the students to stabilize learning and promote cross-task consistency. In addition, we introduce a multi-level collaborative distillation mechanism. This approach enforces peer-to-peer consistency among students and preserves historical knowledge by aligning each student with the GWM. As a result, student models remain adaptable to new tasks while maintaining previously learned knowledge, striking a better balance between stability and plasticity. Extensive experiments on three standard OCIL benchmarks show that our method delivers significant performance improvement for several OCIL models across various memory budgets.

CVMay 29, 2025
Adaptive Spatial Augmentation for Semi-supervised Semantic Segmentation

Lingyan Ran, Yali Li, Tao Zhuo et al.

In semi-supervised semantic segmentation (SSSS), data augmentation plays a crucial role in the weak-to-strong consistency regularization framework, as it enhances diversity and improves model generalization. Recent strong augmentation methods have primarily focused on intensity-based perturbations, which have minimal impact on the semantic masks. In contrast, spatial augmentations like translation and rotation have long been acknowledged for their effectiveness in supervised semantic segmentation tasks, but they are often ignored in SSSS. In this work, we demonstrate that spatial augmentation can also contribute to model training in SSSS, despite generating inconsistent masks between the weak and strong augmentations. Furthermore, recognizing the variability among images, we propose an adaptive augmentation strategy that dynamically adjusts the augmentation for each instance based on entropy. Extensive experiments show that our proposed Adaptive Spatial Augmentation (\textbf{ASAug}) can be integrated as a pluggable module, consistently improving the performance of existing methods and achieving state-of-the-art results on benchmark datasets such as PASCAL VOC 2012, Cityscapes, and COCO.

CVApr 29, 2024
Dual-Modal Prompting for Sketch-Based Image Retrieval

Liying Gao, Bingliang Jiao, Peng Wang et al.

Sketch-based image retrieval (SBIR) associates hand-drawn sketches with their corresponding realistic images. In this study, we aim to tackle two major challenges of this task simultaneously: i) zero-shot, dealing with unseen categories, and ii) fine-grained, referring to intra-category instance-level retrieval. Our key innovation lies in the realization that solely addressing this cross-category and fine-grained recognition task from the generalization perspective may be inadequate since the knowledge accumulated from limited seen categories might not be fully valuable or transferable to unseen target categories. Inspired by this, in this work, we propose a dual-modal prompting CLIP (DP-CLIP) network, in which an adaptive prompting strategy is designed. Specifically, to facilitate the adaptation of our DP-CLIP toward unpredictable target categories, we employ a set of images within the target category and the textual category label to respectively construct a set of category-adaptive prompt tokens and channel scales. By integrating the generated guidance, DP-CLIP could gain valuable category-centric insights, efficiently adapting to novel categories and capturing unique discriminative clues for effective retrieval within each target category. With these designs, our DP-CLIP outperforms the state-of-the-art fine-grained zero-shot SBIR method by 7.3% in Acc.@1 on the Sketchy dataset. Meanwhile, in the other two category-level zero-shot SBIR benchmarks, our method also achieves promising performance.

CVFeb 7, 2024
DMAT: A Dynamic Mask-Aware Transformer for Human De-occlusion

Guoqiang Liang, Jiahao Hu, Qingyue Wang et al.

Human de-occlusion, which aims to infer the appearance of invisible human parts from an occluded image, has great value in many human-related tasks, such as person re-id, and intention inference. To address this task, this paper proposes a dynamic mask-aware transformer (DMAT), which dynamically augments information from human regions and weakens that from occlusion. First, to enhance token representation, we design an expanded convolution head with enlarged kernels, which captures more local valid context and mitigates the influence of surrounding occlusion. To concentrate on the visible human parts, we propose a novel dynamic multi-head human-mask guided attention mechanism through integrating multiple masks, which can prevent the de-occluded regions from assimilating to the background. Besides, a region upsampling strategy is utilized to alleviate the impact of occlusion on interpolated images. During model learning, an amodal loss is developed to further emphasize the recovery effect of human regions, which also refines the model's convergence. Extensive experiments on the AHP dataset demonstrate its superior performance compared to recent state-of-the-art methods.

CVMay 22, 2023
Efficient Bilateral Cross-Modality Cluster Matching for Unsupervised Visible-Infrared Person ReID

De Cheng, Lingfeng He, Nannan Wang et al.

Unsupervised visible-infrared person re-identification (USL-VI-ReID) aims to match pedestrian images of the same identity from different modalities without annotations. Existing works mainly focus on alleviating the modality gap by aligning instance-level features of the unlabeled samples. However, the relationships between cross-modality clusters are not well explored. To this end, we propose a novel bilateral cluster matching-based learning framework to reduce the modality gap by matching cross-modality clusters. Specifically, we design a Many-to-many Bilateral Cross-Modality Cluster Matching (MBCCM) algorithm through optimizing the maximum matching problem in a bipartite graph. Then, the matched pairwise clusters utilize shared visible and infrared pseudo-labels during the model training. Under such a supervisory signal, a Modality-Specific and Modality-Agnostic (MSMA) contrastive learning framework is proposed to align features jointly at a cluster-level. Meanwhile, the cross-modality Consistency Constraint (CC) is proposed to explicitly reduce the large modality discrepancy. Extensive experiments on the public SYSU-MM01 and RegDB datasets demonstrate the effectiveness of the proposed method, surpassing state-of-the-art approaches by a large margin of 8.76% mAP on average.

CVFeb 14, 2022
Adaptive Graph Convolutional Networks for Weakly Supervised Anomaly Detection in Videos

Congqi Cao, Xin Zhang, Shizhou Zhang et al.

For weakly supervised anomaly detection, most existing work is limited to the problem of inadequate video representation due to the inability of modeling long-term contextual information. To solve this, we propose a novel weakly supervised adaptive graph convolutional network (WAGCN) to model the complex contextual relationship among video segments. By which, we fully consider the influence of other video segments on the current one when generating the anomaly probability score for each segment. Firstly, we combine the temporal consistency as well as feature similarity of video segments to construct a global graph, which makes full use of the association information among spatial-temporal features of anomalous events in videos. Secondly, we propose a graph learning layer in order to break the limitation of setting topology manually, which can extract graph adjacency matrix based on data adaptively and effectively. Extensive experiments on two public datasets (i.e., UCF-Crime dataset and ShanghaiTech dataset) demonstrate the effectiveness of our approach which achieves state-of-the-art performance.

CVSep 27, 2021
Text-based Person Search in Full Images via Semantic-Driven Proposal Generation

Shizhou Zhang, De Cheng, Wenlong Luo et al.

Finding target persons in full scene images with a query of text description has important practical applications in intelligent video surveillance.However, different from the real-world scenarios where the bounding boxes are not available, existing text-based person retrieval methods mainly focus on the cross modal matching between the query text descriptions and the gallery of cropped pedestrian images. To close the gap, we study the problem of text-based person search in full images by proposing a new end-to-end learning framework which jointly optimize the pedestrian detection, identification and visual-semantic feature embedding tasks. To take full advantage of the query text, the semantic features are leveraged to instruct the Region Proposal Network to pay more attention to the text-described proposals. Besides, a cross-scale visual-semantic embedding mechanism is utilized to improve the performance. To validate the proposed method, we collect and annotate two large-scale benchmark datasets based on the widely adopted image-based person search datasets CUHK-SYSU and PRW. Comprehensive experiments are conducted on the two datasets and compared with the baseline methods, our method achieves the state-of-the-art performance.

CVMay 24, 2021
Unsupervised Video Summarization with a Convolutional Attentive Adversarial Network

Guoqiang Liang, Yanbing Lv, Shucheng Li et al.

With the explosive growth of video data, video summarization, which attempts to seek the minimum subset of frames while still conveying the main story, has become one of the hottest topics. Nowadays, substantial achievements have been made by supervised learning techniques, especially after the emergence of deep learning. However, it is extremely expensive and difficult to collect human annotation for large-scale video datasets. To address this problem, we propose a convolutional attentive adversarial network (CAAN), whose key idea is to build a deep summarizer in an unsupervised way. Upon the generative adversarial network, our overall framework consists of a generator and a discriminator. The former predicts importance scores for all frames of a video while the latter tries to distinguish the score-weighted frame features from original frame features. Specifically, the generator employs a fully convolutional sequence network to extract global representation of a video, and an attention-based network to output normalized importance scores. To learn the parameters, our objective function is composed of three loss functions, which can guide the frame-level importance score prediction collaboratively. To validate this proposed method, we have conducted extensive experiments on two public benchmarks SumMe and TVSum. The results show the superiority of our proposed method against other state-of-the-art unsupervised approaches. Our method even outperforms some published supervised approaches.

CVOct 25, 2019
Attend to the Difference: Cross-Modality Person Re-identification via Contrastive Correlation

Shizhou Zhang, Yifei Yang, Peng Wang et al.

The problem of cross-modality person re-identification has been receiving increasing attention recently, due to its practical significance. Motivated by the fact that human usually attend to the difference when they compare two similar objects, we propose a dual-path cross-modality feature learning framework which preserves intrinsic spatial strictures and attends to the difference of input cross-modality image pairs. Our framework is composed by two main components: a Dual-path Spatial-structure-preserving Common Space Network (DSCSN) and a Contrastive Correlation Network (CCN). The former embeds cross-modality images into a common 3D tensor space without losing spatial structures, while the latter extracts contrastive features by dynamically comparing input image pairs. Note that the representations generated for the input RGB and Infrared images are mutually dependant to each other. We conduct extensive experiments on two public available RGB-IR ReID datasets, SYSU-MM01 and RegDB, and our proposed method outperforms state-of-the-art algorithms by a large margin with both full and simplified evaluation modes.

CVApr 2, 2019
Vehicle Re-identification in Aerial Imagery: Dataset and Approach

Peng Wang, Bingliang Jiao, Lu Yang et al.

In this work, we construct a large-scale dataset for vehicle re-identification (ReID), which contains 137k images of 13k vehicle instances captured by UAV-mounted cameras. To our knowledge, it is the largest UAV-based vehicle ReID dataset. To increase intra-class variation, each vehicle is captured by at least two UAVs at different locations, with diverse view-angles and flight-altitudes. We manually label a variety of vehicle attributes, including vehicle type, color, skylight, bumper, spare tire and luggage rack. Furthermore, for each vehicle image, the annotator is also required to mark the discriminative parts that helps them to distinguish this particular vehicle from others. Besides the dataset, we also design a specific vehicle ReID algorithm to make full use of the rich annotation information. It is capable of explicitly detecting discriminative parts for each specific vehicle and significantly outperforms the evaluated baselines and state-of-the-art vehicle ReID approaches.