CVMar 25, 2023Code
Diverse Embedding Expansion Network and Low-Light Cross-Modality Benchmark for Visible-Infrared Person Re-identificationYukang Zhang, Hanzi Wang
For the visible-infrared person re-identification (VIReID) task, one of the major challenges is the modality gaps between visible (VIS) and infrared (IR) images. However, the training samples are usually limited, while the modality gaps are too large, which leads that the existing methods cannot effectively mine diverse cross-modality clues. To handle this limitation, we propose a novel augmentation network in the embedding space, called diverse embedding expansion network (DEEN). The proposed DEEN can effectively generate diverse embeddings to learn the informative feature representations and reduce the modality discrepancy between the VIS and IR images. Moreover, the VIReID model may be seriously affected by drastic illumination changes, while all the existing VIReID datasets are captured under sufficient illumination without significant light changes. Thus, we provide a low-light cross-modality (LLCM) dataset, which contains 46,767 bounding boxes of 1,064 identities captured by 9 RGB/IR cameras. Extensive experiments on the SYSU-MM01, RegDB and LLCM datasets show the superiority of the proposed DEEN over several other state-of-the-art methods. The code and dataset are released at: https://github.com/ZYK100/LLCM
CVApr 14, 2023Code
PARFormer: Transformer-based Multi-Task Network for Pedestrian Attribute RecognitionXinwen Fan, Yukang Zhang, Yang Lu et al.
Pedestrian attribute recognition (PAR) has received increasing attention because of its wide application in video surveillance and pedestrian analysis. Extracting robust feature representation is one of the key challenges in this task. The existing methods mainly use the convolutional neural network (CNN) as the backbone network to extract features. However, these methods mainly focus on small discriminative regions while ignoring the global perspective. To overcome these limitations, we propose a pure transformer-based multi-task PAR network named PARFormer, which includes four modules. In the feature extraction module, we build a transformer-based strong baseline for feature extraction, which achieves competitive results on several PAR benchmarks compared with the existing CNN-based baseline methods. In the feature processing module, we propose an effective data augmentation strategy named batch random mask (BRM) block to reinforce the attentive feature learning of random patches. Furthermore, we propose a multi-attribute center loss (MACL) to enhance the inter-attribute discriminability in the feature representations. In the viewpoint perception module, we explore the impact of viewpoints on pedestrian attributes, and propose a multi-view contrastive loss (MCVL) that enables the network to exploit the viewpoint information. In the attribute recognition module, we alleviate the negative-positive imbalance problem to generate the attribute predictions. The above modules interact and jointly learn a highly discriminative feature space, and supervise the generation of the final features. Extensive experimental results show that the proposed PARFormer network performs well compared to the state-of-the-art methods on several public datasets, including PETA, RAP, and PA100K. Code will be released at https://github.com/xwf199/PARFormer.
LGMar 1, 2023
OmniForce: On Human-Centered, Large Model Empowered and Cloud-Edge Collaborative AutoML SystemChao Xue, Wei Liu, Shuai Xie et al.
Automated machine learning (AutoML) seeks to build ML models with minimal human effort. While considerable research has been conducted in the area of AutoML in general, aiming to take humans out of the loop when building artificial intelligence (AI) applications, scant literature has focused on how AutoML works well in open-environment scenarios such as the process of training and updating large models, industrial supply chains or the industrial metaverse, where people often face open-loop problems during the search process: they must continuously collect data, update data and models, satisfy the requirements of the development and deployment environment, support massive devices, modify evaluation metrics, etc. Addressing the open-environment issue with pure data-driven approaches requires considerable data, computing resources, and effort from dedicated data engineers, making current AutoML systems and platforms inefficient and computationally intractable. Human-computer interaction is a practical and feasible way to tackle the problem of open-environment AI. In this paper, we introduce OmniForce, a human-centered AutoML (HAML) system that yields both human-assisted ML and ML-assisted human techniques, to put an AutoML system into practice and build adaptive AI in open-environment scenarios. Specifically, we present OmniForce in terms of ML version management; pipeline-driven development and deployment collaborations; a flexible search strategy framework; and widely provisioned and crowdsourced application algorithms, including large models. Furthermore, the (large) models constructed by OmniForce can be automatically turned into remote services in a few minutes; this process is dubbed model as a service (MaaS). Experimental results obtained in multiple search spaces and real-world use cases demonstrate the efficacy and efficiency of OmniForce.
CVMar 26, 2023
MRCN: A Novel Modality Restitution and Compensation Network for Visible-Infrared Person Re-identificationYukang Zhang, Yan Yan, Jie Li et al.
Visible-infrared person re-identification (VI-ReID), which aims to search identities across different spectra, is a challenging task due to large cross-modality discrepancy between visible and infrared images. The key to reduce the discrepancy is to filter out identity-irrelevant interference and effectively learn modality-invariant person representations. In this paper, we propose a novel Modality Restitution and Compensation Network (MRCN) to narrow the gap between the two modalities. Specifically, we first reduce the modality discrepancy by using two Instance Normalization (IN) layers. Next, to reduce the influence of IN layers on removing discriminative information and to reduce modality differences, we propose a Modality Restitution Module (MRM) and a Modality Compensation Module (MCM) to respectively distill modality-irrelevant and modality-relevant features from the removed information. Then, the modality-irrelevant features are used to restitute to the normalized visible and infrared features, while the modality-relevant features are used to compensate for the features of the other modality. Furthermore, to better disentangle the modality-relevant features and the modality-irrelevant features, we propose a novel Center-Quadruplet Causal (CQC) loss to encourage the network to effectively learn the modality-relevant features and the modality-irrelevant features. Extensive experiments are conducted to validate the superiority of our method on the challenging SYSU-MM01 and RegDB datasets. More remarkably, our method achieves 95.1% in terms of Rank-1 and 89.2% in terms of mAP on the RegDB dataset.
CVFeb 2, 2023
Exploring Invariant Representation for Visible-Infrared Person Re-IdentificationLei Tan, Yukang Zhang, Shengmei Shen et al.
Cross-spectral person re-identification, which aims to associate identities to pedestrians across different spectra, faces a main challenge of the modality discrepancy. In this paper, we address the problem from both image-level and feature-level in an end-to-end hybrid learning framework named robust feature mining network (RFM). In particular, we observe that the reflective intensity of the same surface in photos shot in different wavelengths could be transformed using a linear model. Besides, we show the variable linear factor across the different surfaces is the main culprit which initiates the modality discrepancy. We integrate such a reflection observation into an image-level data augmentation by proposing the linear transformation generator (LTG). Moreover, at the feature level, we introduce a cross-center loss to explore a more compact intra-class distribution and modality-aware spatial attention to take advantage of textured regions more efficiently. Experiment results on two standard cross-spectral person re-identification datasets, i.e., RegDB and SYSU-MM01, have demonstrated state-of-the-art performance.
ROMay 26
Trust, Geometry, and Rules: A Credibility-Aware Reinforcement Learning Framework for Safe USV Navigation under UncertaintyYuhang Zhang, Shuqi Chai, Yukang Zhang et al.
Autonomous navigation of Unmanned Surface Vehicles (USVs) that is safe and compliant with the International Regulations for Preventing Collisions at Sea (COLREGs) remains a formidable challenge in dynamic maritime environments, particularly when perception systems exhibit miscalibrated uncertainty. Existing Reinforcement Learning (RL)-based methods often falter because state-estimation errors induce unreliable belief states that mislead the value function, while discrete traffic rules introduce discontinuity in the learning objective. To address these challenges, we propose a framework integrating credibility-aware learning, geometric safety shielding, and continuous rule-aware embedding. First, Credibility-Weighted Value Learning (CW-VL) introduces a dynamic trust factor derived from the discrepancy between filter-estimated covariance and empirical error statistics to modulate the critic's heteroscedastic loss, preventing policy overfitting to noisy samples. Second, the Covariance-Inflated Velocity Obstacle (CI-VO) maps position-estimation uncertainty into set-wise angular margins, forming a conservative geometric shield that overrides hazardous exploratory actions. Third, Risk-Aware COLREGs Duty Embedding relaxes binary encounter duties into continuous rule-aware signals, providing smooth sector-transition information and suppressing oscillation from sparse rule rewards. Simulated encounter studies demonstrate improved training robustness against perceptual inconsistency and superior collision avoidance and COLREGs compliance over baselines.
SYApr 15
Cascaded TD3-PID Hybrid Controller for Quadrotor Trajectory Tracking in Wind Disturbance EnvironmentsYukang Zhang, Shuqi Chai, Yuhang Zhang et al.
This work presents a cascaded hybrid control framework for quadrotor trajectory tracking under nonlinear dynamics and external disturbances. In quadrotor systems, the altitude and attitude channels exhibit fast, structured dynamics that are well suited to reliable regulation, whereas horizontal-position control is more strongly affected by coupling effects, uncertainty, and disturbances, so that neither pure feedback control nor purely learning-based control alone is equally well suited to all channels. Accordingly, the proposed framework augments conventional proportional-integral-derivative (PID) stabilization for altitude and attitude control with an enhanced Twin Delayed Deep Deterministic Policy Gradient (TD3) agent incorporating a multi-Q-network structure, thereby improving horizontal-position control under severe disturbances. To further strengthen disturbance rejection in altitude and attitude control, a hybrid disturbance observer (HDOB) using low-pass and exponential moving average filtering is embedded in the control loops. The proposed TD3 enhancements are verified through ablation studies, and both numerical simulations and real-world flight tests on the quadrotor platform demonstrate that the proposed method achieves more accurate and robust trajectory tracking under wind disturbances than baseline approaches.
CVNov 2, 2024Code
RLE: A Unified Perspective of Data Augmentation for Cross-Spectral Re-identificationLei Tan, Yukang Zhang, Keke Han et al.
This paper makes a step towards modeling the modality discrepancy in the cross-spectral re-identification task. Based on the Lambertain model, we observe that the non-linear modality discrepancy mainly comes from diverse linear transformations acting on the surface of different materials. From this view, we unify all data augmentation strategies for cross-spectral re-identification by mimicking such local linear transformations and categorizing them into moderate transformation and radical transformation. By extending the observation, we propose a Random Linear Enhancement (RLE) strategy which includes Moderate Random Linear Enhancement (MRLE) and Radical Random Linear Enhancement (RRLE) to push the boundaries of both types of transformation. Moderate Random Linear Enhancement is designed to provide diverse image transformations that satisfy the original linear correlations under constrained conditions, whereas Radical Random Linear Enhancement seeks to generate local linear transformations directly without relying on external information. The experimental results not only demonstrate the superiority and effectiveness of RLE but also confirm its great potential as a general-purpose data augmentation for cross-spectral re-identification. The code is available at \textcolor{magenta}{\url{https://github.com/stone96123/RLE}}.
CVOct 27, 2025Code
MDReID: Modality-Decoupled Learning for Any-to-Any Multi-Modal Object Re-IdentificationYingying Feng, Jie Li, Jie Hu et al.
Real-world object re-identification (ReID) systems often face modality inconsistencies, where query and gallery images come from different sensors (e.g., RGB, NIR, TIR). However, most existing methods assume modality-matched conditions, which limits their robustness and scalability in practical applications. To address this challenge, we propose MDReID, a flexible any-to-any image-level ReID framework designed to operate under both modality-matched and modality-mismatched scenarios. MDReID builds on the insight that modality information can be decomposed into two components: modality-shared features that are predictable and transferable, and modality-specific features that capture unique, modality-dependent characteristics. To effectively leverage this, MDReID introduces two key components: the Modality Decoupling Learning (MDL) and Modality-aware Metric Learning (MML). Specifically, MDL explicitly decomposes modality features into modality-shared and modality-specific representations, enabling effective retrieval in both modality-aligned and mismatched scenarios. MML, a tailored metric learning strategy, further enforces orthogonality and complementarity between the two components to enhance discriminative power across modalities. Extensive experiments conducted on three challenging multi-modality ReID benchmarks (RGBNT201, RGBNT100, MSVR310) consistently demonstrate the superiority of MDReID. Notably, MDReID achieves significant mAP improvements of 9.8\%, 3.0\%, and 11.5\% in general modality-matched scenarios, and average gains of 3.4\%, 11.8\%, and 10.9\% in modality-mismatched scenarios, respectively. The code is available at: \textcolor{magenta}{https://github.com/stone96123/MDReID}.
CVOct 25, 2025Code
GSAlign: Geometric and Semantic Alignment Network for Aerial-Ground Person Re-IdentificationQiao Li, Jie Li, Yukang Zhang et al.
Aerial-Ground person re-identification (AG-ReID) is an emerging yet challenging task that aims to match pedestrian images captured from drastically different viewpoints, typically from unmanned aerial vehicles (UAVs) and ground-based surveillance cameras. The task poses significant challenges due to extreme viewpoint discrepancies, occlusions, and domain gaps between aerial and ground imagery. While prior works have made progress by learning cross-view representations, they remain limited in handling severe pose variations and spatial misalignment. To address these issues, we propose a Geometric and Semantic Alignment Network (GSAlign) tailored for AG-ReID. GSAlign introduces two key components to jointly tackle geometric distortion and semantic misalignment in aerial-ground matching: a Learnable Thin Plate Spline (LTPS) Module and a Dynamic Alignment Module (DAM). The LTPS module adaptively warps pedestrian features based on a set of learned keypoints, effectively compensating for geometric variations caused by extreme viewpoint changes. In parallel, the DAM estimates visibility-aware representation masks that highlight visible body regions at the semantic level, thereby alleviating the negative impact of occlusions and partial observations in cross-view correspondence. A comprehensive evaluation on CARGO with four matching protocols demonstrates the effectiveness of GSAlign, achieving significant improvements of +18.8\% in mAP and +16.8\% in Rank-1 accuracy over previous state-of-the-art methods on the aerial-ground setting. The code is available at: \textcolor{magenta}{https://github.com/stone96123/GSAlign}.
CVJan 4, 2024
Frequency Domain Nuances Mining for Visible-Infrared Person Re-identificationYukang Zhang, Yang Lu, Yan Yan et al.
The key of visible-infrared person re-identification (VIReID) lies in how to minimize the modality discrepancy between visible and infrared images. Existing methods mainly exploit the spatial information while ignoring the discriminative frequency information. To address this issue, this paper aims to reduce the modality discrepancy from the frequency domain perspective. Specifically, we propose a novel Frequency Domain Nuances Mining (FDNM) method to explore the cross-modality frequency domain information, which mainly includes an amplitude guided phase (AGP) module and an amplitude nuances mining (ANM) module. These two modules are mutually beneficial to jointly explore frequency domain visible-infrared nuances, thereby effectively reducing the modality discrepancy in the frequency domain. Besides, we propose a center-guided nuances mining loss to encourage the ANM module to preserve discriminative identity information while discovering diverse cross-modality nuances. Extensive experiments show that the proposed FDNM has significant advantages in improving the performance of VIReID. Specifically, our method outperforms the second-best method by 5.2\% in Rank-1 accuracy and 5.8\% in mAP on the SYSU-MM01 dataset under the indoor search mode, respectively. Besides, we also validate the effectiveness and generalization of our method on the challenging visible-infrared face recognition task. \textcolor{magenta}{The code will be available.}