CVJul 19, 2024Code
Temporal Correlation Meets Embedding: Towards a 2nd Generation of JDE-based Real-Time Multi-Object TrackingYunfei Zhang, Chao Liang, Jin Gao et al.
Joint Detection and Embedding (JDE) trackers have demonstrated excellent performance in Multi-Object Tracking (MOT) tasks by incorporating the extraction of appearance features as auxiliary tasks through embedding Re-Identification task (ReID) into the detector, achieving a balance between inference speed and tracking performance. However, solving the competition between the detector and the feature extractor has always been a challenge. Meanwhile, the issue of directly embedding the ReID task into MOT has remained unresolved. The lack of high discriminability in appearance features results in their limited utility. In this paper, a new learning approach using cross-correlation to capture temporal information of objects is proposed. The feature extraction network is no longer trained solely on appearance features from each frame but learns richer motion features by utilizing feature heatmaps from consecutive frames, which addresses the challenge of inter-class feature similarity. Furthermore, our learning approach is applied to a more lightweight feature extraction network, and treat the feature matching scores as strong cues rather than auxiliary cues, with an appropriate weight calculation to reflect the compatibility between our obtained features and the MOT task. Our tracker, named TCBTrack, achieves state-of-the-art performance on multiple public benchmarks, i.e., MOT17, MOT20, and DanceTrack datasets. Specifically, on the DanceTrack test set, we achieve 56.8 HOTA, 58.1 IDF1 and 92.5 MOTA, making it the best online tracker capable of achieving real-time performance. Comparative evaluations with other trackers prove that our tracker achieves the best balance between speed, robustness and accuracy. Code is available at https://github.com/yfzhang1214/TCBTrack.
CVSep 14, 2024Code
Associate Everything Detected: Facilitating Tracking-by-Detection to the UnknownZimeng Fang, Chao Liang, Xue Zhou et al.
Multi-object tracking (MOT) emerges as a pivotal and highly promising branch in the field of computer vision. Classical closed-vocabulary MOT (CV-MOT) methods aim to track objects of predefined categories. Recently, some open-vocabulary MOT (OV-MOT) methods have successfully addressed the problem of tracking unknown categories. However, we found that the CV-MOT and OV-MOT methods each struggle to excel in the tasks of the other. In this paper, we present a unified framework, Associate Everything Detected (AED), that simultaneously tackles CV-MOT and OV-MOT by integrating with any off-the-shelf detector and supports unknown categories. Different from existing tracking-by-detection MOT methods, AED gets rid of prior knowledge (e.g. motion cues) and relies solely on highly robust feature learning to handle complex trajectories in OV-MOT tasks while keeping excellent performance in CV-MOT tasks. Specifically, we model the association task as a similarity decoding problem and propose a sim-decoder with an association-centric learning mechanism. The sim-decoder calculates similarities in three aspects: spatial, temporal, and cross-clip. Subsequently, association-centric learning leverages these threefold similarities to ensure that the extracted features are appropriate for continuous tracking and robust enough to generalize to unknown categories. Compared with existing powerful OV-MOT and CV-MOT methods, AED achieves superior performance on TAO, SportsMOT, and DanceTrack without any prior knowledge. Our code is available at https://github.com/balabooooo/AED.
CVMay 19, 2022
Learning Feature Fusion for Unsupervised Domain Adaptive Person Re-identificationJin Ding, Xue Zhou
Unsupervised domain adaptive (UDA) person re-identification (ReID) has gained increasing attention for its effectiveness on the target domain without manual annotations. Most fine-tuning based UDA person ReID methods focus on encoding global features for pseudo labels generation, neglecting the local feature that can provide for the fine-grained information. To handle this issue, we propose a Learning Feature Fusion (LF2) framework for adaptively learning to fuse global and local features to obtain a more comprehensive fusion feature representation. Specifically, we first pre-train our model within a source domain, then fine-tune the model on unlabeled target domain based on the teacher-student training strategy. The average weighting teacher network is designed to encode global features, while the student network updating at each iteration is responsible for fine-grained local features. By fusing these multi-view features, multi-level clustering is adopted to generate diverse pseudo labels. In particular, a learnable Fusion Module (FM) for giving prominence to fine-grained local information within the global feature is also proposed to avoid obscure learning of multiple pseudo labels. Experiments show that our proposed LF2 framework outperforms the state-of-the-art with 73.5% mAP and 83.7% Rank1 on Market1501 to DukeMTMC-ReID, and achieves 83.2% mAP and 92.8% Rank1 on DukeMTMC-ReID to Market1501.
CVNov 13, 2025Code
GEA: Generation-Enhanced Alignment for Text-to-Image Person RetrievalHao Zou, Runqing Zhang, Xue Zhou et al.
Text-to-Image Person Retrieval (TIPR) aims to retrieve person images based on natural language descriptions. Although many TIPR methods have achieved promising results, sometimes textual queries cannot accurately and comprehensively reflect the content of the image, leading to poor cross-modal alignment and overfitting to limited datasets. Moreover, the inherent modality gap between text and image further amplifies these issues, making accurate cross-modal retrieval even more challenging. To address these limitations, we propose the Generation-Enhanced Alignment (GEA) from a generative perspective. GEA contains two parallel modules: (1) Text-Guided Token Enhancement (TGTE), which introduces diffusion-generated images as intermediate semantic representations to bridge the gap between text and visual patterns. These generated images enrich the semantic representation of text and facilitate cross-modal alignment. (2) Generative Intermediate Fusion (GIF), which combines cross-attention between generated images, original images, and text features to generate a unified representation optimized by triplet alignment loss. We conduct extensive experiments on three public TIPR datasets, CUHK-PEDES, RSTPReid, and ICFG-PEDES, to evaluate the performance of GEA. The results justify the effectiveness of our method. More implementation details and extended results are available at https://github.com/sugelamyd123/Sup-for-GEA.
CVMay 13
OP4KSR: One-Step Patch-Free 4K Super-Resolution with Periodic Artifact SuppressionChengyan Deng, Pengbin Yu, Zhentao Chen et al.
Diffusion-based real-world image super-resolution (Real-ISR) has achieved remarkable perceptual quality; however, directly super-resolving images to 4K remains limited by extreme memory consumption. Consequently, prior methods adopt patch-based inference, sacrificing global context and introducing semantic confusion, spatial inconsistency, and severe latency. We propose OP4KSR, a one-step patch-free 4K SR approach built upon the powerful Flux backbone. By leveraging the extreme-compression F16 VAE, OP4KSR makes 4K SR inference tractable under practical GPU budgets, preserving global spatial-semantic coherence while enabling highly efficient inference. However, adapting this one-step architecture intrinsically triggers severe periodic artifacts. We trace this to a RoPE base frequency allocation mismatch and intra-token spatial ambiguity, both exacerbated by the lack of iterative refinement. To suppress these artifacts, we couple RoPE base frequency rescaling (RFR) with an autocorrelation-based periodicity loss ($\mathcal{L}_\text{AP}$). Furthermore, we curate a dedicated training dataset alongside three benchmarks (one synthetic and two real-world) to advance 4K SR research. Extensive experiments demonstrate that OP4KSR achieves competitive perceptual quality with efficient inference, generating a $4096\times4096$ output in only 5.75 seconds on a single NVIDIA H20 GPU.
CVApr 19, 2021Code
One More Check: Making "Fake Background" Be Tracked AgainChao Liang, Zhipeng Zhang, Xue Zhou et al.
The one-shot multi-object tracking, which integrates object detection and ID embedding extraction into a unified network, has achieved groundbreaking results in recent years. However, current one-shot trackers solely rely on single-frame detections to predict candidate bounding boxes, which may be unreliable when facing disastrous visual degradation, e.g., motion blur, occlusions. Once a target bounding box is mistakenly classified as background by the detector, the temporal consistency of its corresponding tracklet will be no longer maintained. In this paper, we set out to restore the bounding boxes misclassified as ``fake background'' by proposing a re-check network. The re-check network innovatively expands the role of ID embedding from data association to motion forecasting by effectively propagating previous tracklets to the current frame with a small overhead. Note that the propagation results are yielded by an independent and efficient embedding search, preventing the model from over-relying on detection results. Eventually, it helps to reload the ``fake background'' and repair the broken tracklets. Building on a strong baseline CSTrack, we construct a new one-shot tracker and achieve favorable gains by 70.7 $\rightarrow$ 76.4, 70.6 $\rightarrow$ 76.3 MOTA on MOT16 and MOT17, respectively. It also reaches a new state-of-the-art MOTA and IDF1 performance. Code is released at https://github.com/JudasDie/SOTS.
CVOct 23, 2020Code
Rethinking the competition between detection and ReID in Multi-Object TrackingChao Liang, Zhipeng Zhang, Xue Zhou et al.
Due to balanced accuracy and speed, one-shot models which jointly learn detection and identification embeddings, have drawn great attention in multi-object tracking (MOT). However, the inherent differences and relations between detection and re-identification (ReID) are unconsciously overlooked because of treating them as two isolated tasks in the one-shot tracking paradigm. This leads to inferior performance compared with existing two-stage methods. In this paper, we first dissect the reasoning process for these two tasks, which reveals that the competition between them inevitably would destroy task-dependent representations learning. To tackle this problem, we propose a novel reciprocal network (REN) with a self-relation and cross-relation design so that to impel each branch to better learn task-dependent representations. The proposed model aims to alleviate the deleterious tasks competition, meanwhile improve the cooperation between detection and ReID. Furthermore, we introduce a scale-aware attention network (SAAN) that prevents semantic level misalignment to improve the association capability of ID embeddings. By integrating the two delicately designed networks into a one-shot online MOT system, we construct a strong MOT tracker, namely CSTrack. Our tracker achieves the state-of-the-art performance on MOT16, MOT17 and MOT20 datasets, without other bells and whistles. Moreover, CSTrack is efficient and runs at 16.4 FPS on a single modern GPU, and its lightweight version even runs at 34.6 FPS. The complete code has been released at https://github.com/JudasDie/SOTS.
CVSep 5, 2019Code
Adaptive Graph Representation Learning for Video Person Re-identificationYiming Wu, Omar El Farouk Bourahla, Xi Li et al.
Recent years have witnessed the remarkable progress of applying deep learning models in video person re-identification (Re-ID). A key factor for video person Re-ID is to effectively construct discriminative and robust video feature representations for many complicated situations. Part-based approaches employ spatial and temporal attention to extract representative local features. While correlations between parts are ignored in the previous methods, to leverage the relations of different parts, we propose an innovative adaptive graph representation learning scheme for video person Re-ID, which enables the contextual interactions between relevant regional features. Specifically, we exploit the pose alignment connection and the feature affinity connection to construct an adaptive structure-aware adjacency graph, which models the intrinsic relations between graph nodes. We perform feature propagation on the adjacency graph to refine regional features iteratively, and the neighbor nodes' information is taken into account for part feature representation. To learn compact and discriminative representations, we further propose a novel temporal resolution-aware regularization, which enforces the consistency among different temporal resolutions for the same identities. We conduct extensive evaluations on four benchmarks, i.e. iLIDS-VID, PRID2011, MARS, and DukeMTMC-VideoReID, experimental results achieve the competitive performance which demonstrates the effectiveness of our proposed method. The code is available at https://github.com/weleen/AGRL.pytorch.
CVSep 10, 2024
AMNS: Attention-Weighted Selective Mask and Noise Label Suppression for Text-to-Image Person RetrievalRunqing Zhang, Xue Zhou
Most existing text-to-image person retrieval methods usually assume that the training image-text pairs are perfectly aligned; however, the noisy correspondence(NC) issue (i.e., incorrect or unreliable alignment) exists due to poor image quality and labeling errors. Additionally, random masking augmentation may inadvertently discard critical semantic content, introducing noisy matches between images and text descriptions. To address the above two challenges, we propose a noise label suppression method to mitigate NC and an Attention-Weighted Selective Mask (AWM) strategy to resolve the issues caused by random masking. Specifically, the Bidirectional Similarity Distribution Matching (BSDM) loss enables the model to effectively learn from positive pairs while preventing it from over-relying on them, thereby mitigating the risk of overfitting to noisy labels. In conjunction with this, Weight Adjustment Focal (WAF) loss improves the model's ability to handle hard samples. Furthermore, AWM processes raw images through an EMA version of the image encoder, selectively retaining tokens with strong semantic connections to the text, enabling better feature extraction. Extensive experiments demonstrate the effectiveness of our approach in addressing noise-related issues and improving retrieval performance.
CVApr 27, 2024
Unsupervised Anomaly Detection via Masked Diffusion Posterior SamplingDi Wu, Shicai Fan, Xue Zhou et al.
Reconstruction-based methods have been commonly used for unsupervised anomaly detection, in which a normal image is reconstructed and compared with the given test image to detect and locate anomalies. Recently, diffusion models have shown promising applications for anomaly detection due to their powerful generative ability. However, these models lack strict mathematical support for normal image reconstruction and unexpectedly suffer from low reconstruction quality. To address these issues, this paper proposes a novel and highly-interpretable method named Masked Diffusion Posterior Sampling (MDPS). In MDPS, the problem of normal image reconstruction is mathematically modeled as multiple diffusion posterior sampling for normal images based on the devised masked noisy observation model and the diffusion-based normal image prior under Bayesian framework. Using a metric designed from pixel-level and perceptual-level perspectives, MDPS can effectively compute the difference map between each normal posterior sample and the given test image. Anomaly scores are obtained by averaging all difference maps for multiple posterior samples. Exhaustive experiments on MVTec and BTAD datasets demonstrate that MDPS can achieve state-of-the-art performance in normal image reconstruction quality as well as anomaly detection and localization.
LGJun 12, 2025
Collapsing Sequence-Level Data-Policy Coverage via Poisoning Attack in Offline Reinforcement LearningXue Zhou, Dapeng Man, Chen Xu et al.
Offline reinforcement learning (RL) heavily relies on the coverage of pre-collected data over the target policy's distribution. Existing studies aim to improve data-policy coverage to mitigate distributional shifts, but overlook security risks from insufficient coverage, and the single-step analysis is not consistent with the multi-step decision-making nature of offline RL. To address this, we introduce the sequence-level concentrability coefficient to quantify coverage, and reveal its exponential amplification on the upper bound of estimation errors through theoretical analysis. Building on this, we propose the Collapsing Sequence-Level Data-Policy Coverage (CSDPC) poisoning attack. Considering the continuous nature of offline RL data, we convert state-action pairs into decision units, and extract representative decision patterns that capture multi-step behavior. We identify rare patterns likely to cause insufficient coverage, and poison them to reduce coverage and exacerbate distributional shifts. Experiments show that poisoning just 1% of the dataset can degrade agent performance by 90%. This finding provides new perspectives for analyzing and safeguarding the security of offline RL.