Shibo He

LG
h-index50
37papers
1,034citations
Novelty55%
AI Score58

37 Papers

CVOct 29, 2023Code
AnomalyCLIP: Object-agnostic Prompt Learning for Zero-shot Anomaly Detection

Qihang Zhou, Guansong Pang, Yu Tian et al.

Zero-shot anomaly detection (ZSAD) requires detection models trained using auxiliary data to detect anomalies without any training sample in a target dataset. It is a crucial task when training data is not accessible due to various concerns, eg, data privacy, yet it is challenging since the models need to generalize to anomalies across different domains where the appearance of foreground objects, abnormal regions, and background features, such as defects/tumors on different products/organs, can vary significantly. Recently large pre-trained vision-language models (VLMs), such as CLIP, have demonstrated strong zero-shot recognition ability in various vision tasks, including anomaly detection. However, their ZSAD performance is weak since the VLMs focus more on modeling the class semantics of the foreground objects rather than the abnormality/normality in the images. In this paper we introduce a novel approach, namely AnomalyCLIP, to adapt CLIP for accurate ZSAD across different domains. The key insight of AnomalyCLIP is to learn object-agnostic text prompts that capture generic normality and abnormality in an image regardless of its foreground objects. This allows our model to focus on the abnormal image regions rather than the object semantics, enabling generalized normality and abnormality recognition on diverse types of objects. Large-scale experiments on 17 real-world anomaly detection datasets show that AnomalyCLIP achieves superior zero-shot performance of detecting and segmenting anomalies in datasets of highly diverse class semantics from various defect inspection and medical imaging domains. Code will be made available at https://github.com/zqhang/AnomalyCLIP.

LGAug 3, 2022Code
Detecting Multivariate Time Series Anomalies with Zero Known Label

Qihang Zhou, Jiming Chen, Haoyu Liu et al.

Multivariate time series anomaly detection has been extensively studied under the semi-supervised setting, where a training dataset with all normal instances is required. However, preparing such a dataset is very laborious since each single data instance should be fully guaranteed to be normal. It is, therefore, desired to explore multivariate time series anomaly detection methods based on the dataset without any label knowledge. In this paper, we propose MTGFlow, an unsupervised anomaly detection approach for multivariate time series anomaly detection via dynamic graph and entity-aware normalizing flow, leaning only on a widely accepted hypothesis that abnormal instances exhibit sparse densities than the normal. However, the complex interdependencies among entities and the diverse inherent characteristics of each entity pose significant challenges on the density estimation, let alone to detect anomalies based on the estimated possibility distribution. To tackle these problems, we propose to learn the mutual and dynamic relations among entities via a graph structure learning model, which helps to model accurate distribution of multivariate time series. Moreover, taking account of distinct characteristics of the individual entities, an entity-aware normalizing flow is developed to describe each entity into a parameterized normal distribution, thereby producing fine-grained density estimation. Incorporating these two strategies, MTGFlow achieves superior anomaly detection performance. Experiments on five public datasets with seven baselines are conducted, MTGFlow outperforms the SOTA methods by up to 5.0 AUROC\%. Codes will be released at https://github.com/zqhang/Detecting-Multivariate-Time-Series-Anomalies-with-Zero-Known-Label.

CRDec 16, 2025Code
VICTOR: Dataset Copyright Auditing in Video Recognition Systems

Quan Yuan, Zhikun Zhang, Linkang Du et al.

Video recognition systems are increasingly being deployed in daily life, such as content recommendation and security monitoring. To enhance video recognition development, many institutions have released high-quality public datasets with open-source licenses for training advanced models. At the same time, these datasets are also susceptible to misuse and infringement. Dataset copyright auditing is an effective solution to identify such unauthorized use. However, existing dataset copyright solutions primarily focus on the image domain; the complex nature of video data leaves dataset copyright auditing in the video domain unexplored. Specifically, video data introduces an additional temporal dimension, which poses significant challenges to the effectiveness and stealthiness of existing methods. In this paper, we propose VICTOR, the first dataset copyright auditing approach for video recognition systems. We develop a general and stealthy sample modification strategy that enhances the output discrepancy of the target model. By modifying only a small proportion of samples (e.g., 1%), VICTOR amplifies the impact of published modified samples on the prediction behavior of the target models. Then, the difference in the model's behavior for published modified and unpublished original samples can serve as a key basis for dataset auditing. Extensive experiments on multiple models and datasets highlight the superiority of VICTOR. Finally, we show that VICTOR is robust in the presence of several perturbation mechanisms to the training videos or the target models.

CVJul 22, 2022
NeurAR: Neural Uncertainty for Autonomous 3D Reconstruction with Implicit Neural Representations

Yunlong Ran, Jing Zeng, Shibo He et al.

Implicit neural representations have shown compelling results in offline 3D reconstruction and also recently demonstrated the potential for online SLAM systems. However, applying them to autonomous 3D reconstruction, where a robot is required to explore a scene and plan a view path for the reconstruction, has not been studied. In this paper, we explore for the first time the possibility of using implicit neural representations for autonomous 3D scene reconstruction by addressing two key challenges: 1) seeking a criterion to measure the quality of the candidate viewpoints for the view planning based on the new representations, and 2) learning the criterion from data that can generalize to different scenes instead of a hand-crafting one. To solve the challenges, firstly, a proxy of Peak Signal-to-Noise Ratio (PSNR) is proposed to quantify a viewpoint quality; secondly, the proxy is optimized jointly with the parameters of an implicit neural network for the scene. With the proposed view quality criterion from neural networks (termed as Neural Uncertainty), we can then apply implicit representations to autonomous 3D reconstruction. Our method demonstrates significant improvements on various metrics for the rendered image quality and the geometry quality of the reconstructed 3D models when compared with variants using TSDF or reconstruction without view planning. Project webpage https://kingteeloki-ran.github.io/NeurAR/

LGMay 27, 2022
AsyncFedED: Asynchronous Federated Learning with Euclidean Distance based Adaptive Weight Aggregation

Qiyuan Wang, Qianqian Yang, Shibo He et al.

In an asynchronous federated learning framework, the server updates the global model once it receives an update from a client instead of waiting for all the updates to arrive as in the synchronous setting. This allows heterogeneous devices with varied computing power to train the local models without pausing, thereby speeding up the training process. However, it introduces the stale model problem, where the newly arrived update was calculated based on a set of stale weights that are older than the current global model, which may hurt the convergence of the model. In this paper, we present an asynchronous federated learning framework with a proposed adaptive weight aggregation algorithm, referred to as AsyncFedED. To the best of our knowledge this aggregation method is the first to take the staleness of the arrived gradients, measured by the Euclidean distance between the stale model and the current global model, and the number of local epochs that have been performed, into account. Assuming general non-convex loss functions, we prove the convergence of the proposed method theoretically. Numerical results validate the effectiveness of the proposed AsyncFedED in terms of the convergence rate and model accuracy compared to the existing methods for three considered tasks.

SPJul 25, 2024
GesturePrint: Enabling User Identification for mmWave-based Gesture Recognition Systems

Lilin Xu, Keyi Wang, Chaojie Gu et al.

The millimeter-wave (mmWave) radar has been exploited for gesture recognition. However, existing mmWave-based gesture recognition methods cannot identify different users, which is important for ubiquitous gesture interaction in many applications. In this paper, we propose GesturePrint, which is the first to achieve gesture recognition and gesture-based user identification using a commodity mmWave radar sensor. GesturePrint features an effective pipeline that enables the gesture recognition system to identify users at a minor additional cost. By introducing an efficient signal preprocessing stage and a network architecture GesIDNet, which employs an attention-based multilevel feature fusion mechanism, GesturePrint effectively extracts unique gesture features for gesture recognition and personalized motion pattern features for user identification. We implement GesturePrint and collect data from 17 participants performing 15 gestures in a meeting room and an office, respectively. GesturePrint achieves a gesture recognition accuracy (GRA) of 98.87% with a user identification accuracy (UIA) of 99.78% in the meeting room, and 98.22% GRA with 99.26% UIA in the office. Extensive experiments on three public datasets and a new gesture dataset show GesturePrint's superior performance in enabling effective user identification for gesture recognition systems.

LGNov 22, 2023
Confidant: Customizing Transformer-based LLMs via Collaborative Edge Training

Yuhao Chen, Yuxuan Yan, Qianqian Yang et al.

Transformer-based large language models (LLMs) have demonstrated impressive capabilities in a variety of natural language processing (NLP) tasks. Nonetheless, it is challenging to deploy and fine-tune LLMs on mobile edge devices with limited computing, memory, and energy budgets. In this paper, we propose Confidant, a multi-backend collaborative training framework for customizing state-of-the-art LLMs on commodity mobile devices like smartphones. Confidant partitions an LLM into several sub-models so that each fits into a mobile device's memory. A pipeline parallel training mechanism is further developed to ensure fast and efficient distributed training. In addition, we propose a novel backend scheduler to allocate different attention heads to heterogeneous compute hardware, including mobile CPU and GPUs, to maximize the compute resource utilization on each edge device. Our preliminary experimental results show that Confidant achieves at most 45.3% memory reduction and 8.03x inference speedup in practical settings.

LGNov 10, 2023
AccEPT: An Acceleration Scheme for Speeding Up Edge Pipeline-parallel Training

Yuhao Chen, Yuxuan Yan, Qianqian Yang et al.

It is usually infeasible to fit and train an entire large deep neural network (DNN) model using a single edge device due to the limited resources. To facilitate intelligent applications across edge devices, researchers have proposed partitioning a large model into several sub-models, and deploying each of them to a different edge device to collaboratively train a DNN model. However, the communication overhead caused by the large amount of data transmitted from one device to another during training, as well as the sub-optimal partition point due to the inaccurate latency prediction of computation at each edge device can significantly slow down training. In this paper, we propose AccEPT, an acceleration scheme for accelerating the edge collaborative pipeline-parallel training. In particular, we propose a light-weight adaptive latency predictor to accurately estimate the computation latency of each layer at different devices, which also adapts to unseen devices through continuous learning. Therefore, the proposed latency predictor leads to better model partitioning which balances the computation loads across participating devices. Moreover, we propose a bit-level computation-efficient data compression scheme to compress the data to be transmitted between devices during training. Our numerical results demonstrate that our proposed acceleration approach is able to significantly speed up edge pipeline parallel training up to 3 times faster in the considered experimental settings.

DCJul 21, 2023
Robust Fully-Asynchronous Methods for Distributed Training over General Architecture

Zehan Zhu, Ye Tian, Yan Huang et al.

Perfect synchronization in distributed machine learning problems is inefficient and even impossible due to the existence of latency, package losses and stragglers. We propose a Robust Fully-Asynchronous Stochastic Gradient Tracking method (R-FAST), where each device performs local computation and communication at its own pace without any form of synchronization. Different from existing asynchronous distributed algorithms, R-FAST can eliminate the impact of data heterogeneity across devices and allow for packet losses by employing a robust gradient tracking strategy that relies on properly designed auxiliary variables for tracking and buffering the overall gradient vector. More importantly, the proposed method utilizes two spanning-tree graphs for communication so long as both share at least one common root, enabling flexible designs in communication architectures. We show that R-FAST converges in expectation to a neighborhood of the optimum with a geometric rate for smooth and strongly convex objectives; and to a stationary point with a sublinear rate for general non-convex settings. Extensive experiments demonstrate that R-FAST runs 1.5-2 times faster than synchronous benchmark algorithms, such as Ring-AllReduce and D-PSGD, while still achieving comparable accuracy, and outperforms existing asynchronous SOTA algorithms, such as AD-PSGD and OSGP, especially in the presence of stragglers.

CVNov 18, 2022
$α$ DARTS Once More: Enhancing Differentiable Architecture Search by Masked Image Modeling

Bicheng Guo, Shuxuan Guo, Miaojing Shi et al.

Differentiable architecture search (DARTS) has been a mainstream direction in automatic machine learning. Since the discovery that original DARTS will inevitably converge to poor architectures, recent works alleviate this by either designing rule-based architecture selection techniques or incorporating complex regularization techniques, abandoning the simplicity of the original DARTS that selects architectures based on the largest parametric value, namely $α$. Moreover, we find that all the previous attempts only rely on classification labels, hence learning only single modal information and limiting the representation power of the shared network. To this end, we propose to additionally inject semantic information by formulating a patch recovery approach. Specifically, we exploit the recent trending masked image modeling and do not abandon the guidance from the downstream tasks during the search phase. Our method surpasses all previous DARTS variants and achieves state-of-the-art results on CIFAR-10, CIFAR-100, and ImageNet without complex manual-designed strategies.

LGSep 26, 2024
Dataset Distillation-based Hybrid Federated Learning on Non-IID Data

Xiufang Shi, Wei Zhang, Mincheng Wu et al.

With the development of edge computing, Federated Learning (FL) has emerged as a promising solution for the intelligent Internet of Things (IoT). However, applying FL in mobile edge-cloud networks is greatly challenged by statistical heterogeneity and high communication overhead. To address it, we propose a hybrid federated learning framework called HFLDD, which integrates dataset distillation to generate approximately independent and equally distributed (IID) data, thereby improving the performance of model training. In particular, we partition the clients into heterogeneous clusters, where the data labels among different clients within a cluster are unbalanced while the data labels among different clusters are balanced. The cluster heads collect distilled data from the corresponding cluster members, and conduct model training in collaboration with the server. This training process is like traditional federated learning on IID data, and hence effectively alleviates the impact of non-IID data on model training. We perform a comprehensive analysis of the convergence behavior, communication overhead, and computational complexity of the proposed HFLDD. Extensive experimental results based on multiple public datasets demonstrate that when data labels are severely imbalanced, the proposed HFLDD outperforms the baseline methods in terms of both test accuracy and communication cost.

CVApr 1Code
PC-SAM: Patch-Constrained Fine-Grained Interactive Road Segmentation in High-Resolution Remote Sensing Images

Chengcheng Lv, Rushi Li, Mincheng Wu et al.

Road masks obtained from remote sensing images effectively support a wide range of downstream tasks. In recent years, most studies have focused on improving the performance of fully automatic segmentation models for this task, achieving significant gains. However, current fully automatic methods are still insufficient for identifying certain challenging road segments and often produce false positive and false negative regions. Moreover, fully automatic segmentation does not support local segmentation of regions of interest or refinement of existing masks. Although the SAM model is widely used as an interactive segmentation model and performs well on natural images, it shows poor performance in remote sensing road segmentation and cannot support fine-grained local refinement. To address these limitations, we propose PC-SAM, which integrates fully automatic road segmentation and interactive segmentation within a unified framework. By carefully designing a fine-tuning strategy, the influence of point prompts is constrained to their corresponding patches, overcoming the inability of the original SAM to perform fine local corrections and enabling fine-grained interactive mask refinement. Extensive experiments on several representative remote sensing road segmentation datasets demonstrate that, when combined with point prompts, PC-SAM significantly outperforms state-of-the-art fully automatic models in road mask segmentation, while also providing flexible local mask refinement and local road segmentation. The code will be available at https://github.com/Cyber-CCOrange/PC-SAM.

SPJan 15
Sim2Real Deep Transfer for Per-Device CFO Calibration

Jingze Zheng, Zhiguo Shi, Shibo He et al.

Carrier Frequency Offset (CFO) estimation in Orthogonal Frequency Division Multiplexing (OFDM) systems faces significant performance degradation across heterogeneous software-defined radio (SDR) platforms due to uncalibrated hardware impairments. Existing deep neural network (DNN)-based approaches lack device-level adaptation, limiting their practical deployment. This paper proposes a Sim2Real transfer learning framework for per-device CFO calibration, combining simulation-driven pretraining with lightweight receiver adaptation. A backbone DNN is pre-trained on synthetic OFDM signals incorporating parametric hardware distortions (e.g., phase noise, IQ imbalance), enabling generalized feature learning without costly cross-device data collection. Subsequently, only the regression layers are fine-tuned using $1,000$ real frames per target device, preserving hardware-agnostic knowledge while adapting to device-specific impairments. Experiments across three SDR families (USRP B210, USRP N210, HackRF One) achieve $30\times$ BER reduction compared to conventional CP-based methods under indoor multipath conditions. The framework bridges the simulation-to-reality gap for robust CFO estimation, enabling cost-effective deployment in heterogeneous wireless systems.

AIDec 11, 2025
Reverse Thinking Enhances Missing Information Detection in Large Language Models

Yuxin Liu, Chaojie Gu, Yihang Zhang et al.

Large Language Models (LLMs) have demonstrated remarkable capabilities in various reasoning tasks, yet they often struggle with problems involving missing information, exhibiting issues such as incomplete responses, factual errors, and hallucinations. While forward reasoning approaches like Chain-of-Thought (CoT) and Tree-of-Thought (ToT) have shown success in structured problem-solving, they frequently fail to systematically identify and recover omitted information. In this paper, we explore the potential of reverse thinking methodologies to enhance LLMs' performance on missing information detection tasks. Drawing inspiration from recent work on backward reasoning, we propose a novel framework that guides LLMs through reverse thinking to identify necessary conditions and pinpoint missing elements. Our approach transforms the challenging task of missing information identification into a more manageable backward reasoning problem, significantly improving model accuracy. Experimental results demonstrate that our reverse thinking approach achieves substantial performance gains compared to traditional forward reasoning methods, providing a promising direction for enhancing LLMs' logical completeness and reasoning robustness.

CVNov 29, 2024Code
FairDD: Fair Dataset Distillation

Qihang Zhou, Shenhao Fang, Shibo He et al.

Condensing large datasets into smaller synthetic counterparts has demonstrated its promise for image classification. However, previous research has overlooked a crucial concern in image recognition: ensuring that models trained on condensed datasets are unbiased towards protected attributes (PA), such as gender and race. Our investigation reveals that dataset distillation fails to alleviate the unfairness towards minority groups within original datasets. Moreover, this bias typically worsens in the condensed datasets due to their smaller size. To bridge the research gap, we propose a novel fair dataset distillation (FDD) framework, namely FairDD, which can be seamlessly applied to diverse matching-based DD approaches (DDs), requiring no modifications to their original architectures. The key innovation of FairDD lies in synchronously matching synthetic datasets to PA-wise groups of original datasets, rather than indiscriminate alignment to the whole distributions in vanilla DDs, dominated by majority groups. This synchronized matching allows synthetic datasets to avoid collapsing into majority groups and bootstrap their balanced generation to all PA groups. Consequently, FairDD could effectively regularize vanilla DDs to favor biased generation toward minority groups while maintaining the accuracy of target attributes. Theoretical analyses and extensive experimental evaluations demonstrate that FairDD significantly improves fairness compared to vanilla DDs, with a promising trade-off between fairness and accuracy. Its consistent superiority across diverse DDs, spanning Distribution and Gradient Matching, establishes it as a versatile FDD approach. Code is available at https://github.com/zqhang/FairDD.

LGJan 20, 2024Code
TreeMIL: A Multi-instance Learning Framework for Time Series Anomaly Detection with Inexact Supervision

Chen Liu, Shibo He, Haoyu Liu et al.

Time series anomaly detection (TSAD) plays a vital role in various domains such as healthcare, networks, and industry. Considering labels are crucial for detection but difficult to obtain, we turn to TSAD with inexact supervision: only series-level labels are provided during the training phase, while point-level anomalies are predicted during the testing phase. Previous works follow a traditional multi-instance learning (MIL) approach, which focuses on encouraging high anomaly scores at individual time steps. However, time series anomalies are not only limited to individual point anomalies, they can also be collective anomalies, typically exhibiting abnormal patterns over subsequences. To address the challenge of collective anomalies, in this paper, we propose a tree-based MIL framework (TreeMIL). We first adopt an N-ary tree structure to divide the entire series into multiple nodes, where nodes at different levels represent subsequences with different lengths. Then, the subsequence features are extracted to determine the presence of collective anomalies. Finally, we calculate point-level anomaly scores by aggregating features from nodes at different levels. Experiments conducted on seven public datasets and eight baselines demonstrate that TreeMIL achieves an average 32.3% improvement in F1- score compared to previous state-of-the-art methods. The code is available at https://github.com/fly-orange/TreeMIL.

LGNov 12, 2025
Hierarchical Schedule Optimization for Fast and Robust Diffusion Model Sampling

Aihua Zhu, Rui Su, Qinglin Zhao et al.

Diffusion probabilistic models have set a new standard for generative fidelity but are hindered by a slow iterative sampling process. A powerful training-free strategy to accelerate this process is Schedule Optimization, which aims to find an optimal distribution of timesteps for a fixed and small Number of Function Evaluations (NFE) to maximize sample quality. To this end, a successful schedule optimization method must adhere to four core principles: effectiveness, adaptivity, practical robustness, and computational efficiency. However, existing paradigms struggle to satisfy these principles simultaneously, motivating the need for a more advanced solution. To overcome these limitations, we propose the Hierarchical-Schedule-Optimizer (HSO), a novel and efficient bi-level optimization framework. HSO reframes the search for a globally optimal schedule into a more tractable problem by iteratively alternating between two synergistic levels: an upper-level global search for an optimal initialization strategy and a lower-level local optimization for schedule refinement. This process is guided by two key innovations: the Midpoint Error Proxy (MEP), a solver-agnostic and numerically stable objective for effective local optimization, and the Spacing-Penalized Fitness (SPF) function, which ensures practical robustness by penalizing pathologically close timesteps. Extensive experiments show that HSO sets a new state-of-the-art for training-free sampling in the extremely low-NFE regime. For instance, with an NFE of just 5, HSO achieves a remarkable FID of 11.94 on LAION-Aesthetics with Stable Diffusion v2.1. Crucially, this level of performance is attained not through costly retraining, but with a one-time optimization cost of less than 8 seconds, presenting a highly practical and efficient paradigm for diffusion model acceleration.

LGDec 17, 2023
Label-Free Multivariate Time Series Anomaly Detection

Qihang Zhou, Shibo He, Haoyu Liu et al.

Anomaly detection in multivariate time series (MTS) has been widely studied in one-class classification (OCC) setting. The training samples in OCC are assumed to be normal, which is difficult to guarantee in practical situations. Such a case may degrade the performance of OCC-based anomaly detection methods which fit the training distribution as the normal distribution. In this paper, we propose MTGFlow, an unsupervised anomaly detection approach for MTS anomaly detection via dynamic Graph and entity-aware normalizing Flow. MTGFlow first estimates the density of the entire training samples and then identifies anomalous instances based on the density of the test samples within the fitted distribution. This relies on a widely accepted assumption that anomalous instances exhibit more sparse densities than normal ones, with no reliance on the clean training dataset. However, it is intractable to directly estimate the density due to complex dependencies among entities and their diverse inherent characteristics. To mitigate this, we utilize the graph structure learning model to learn interdependent and evolving relations among entities, which effectively captures complex and accurate distribution patterns of MTS. In addition, our approach incorporates the unique characteristics of individual entities by employing an entity-aware normalizing flow. This enables us to represent each entity as a parameterized normal distribution. Furthermore, considering that some entities present similar characteristics, we propose a cluster strategy that capitalizes on the commonalities of entities with similar characteristics, resulting in more precise and detailed density estimation. We refer to this cluster-aware extension as MTGFlow_cluster. Extensive experiments are conducted on six widely used benchmark datasets, in which MTGFlow and MTGFlow cluster demonstrate their superior detection performance.

LGApr 2, 2024
MESEN: Exploit Multimodal Data to Design Unimodal Human Activity Recognition with Few Labels

Lilin Xu, Chaojie Gu, Rui Tan et al.

Human activity recognition (HAR) will be an essential function of various emerging applications. However, HAR typically encounters challenges related to modality limitations and label scarcity, leading to an application gap between current solutions and real-world requirements. In this work, we propose MESEN, a multimodal-empowered unimodal sensing framework, to utilize unlabeled multimodal data available during the HAR model design phase for unimodal HAR enhancement during the deployment phase. From a study on the impact of supervised multimodal fusion on unimodal feature extraction, MESEN is designed to feature a multi-task mechanism during the multimodal-aided pre-training stage. With the proposed mechanism integrating cross-modal feature contrastive learning and multimodal pseudo-classification aligning, MESEN exploits unlabeled multimodal data to extract effective unimodal features for each modality. Subsequently, MESEN can adapt to downstream unimodal HAR with only a few labeled samples. Extensive experiments on eight public multimodal datasets demonstrate that MESEN achieves significant performance improvements over state-of-the-art baselines in enhancing unimodal HAR by exploiting multimodal data.

CVOct 24, 2024
Radar and Camera Fusion for Object Detection and Tracking: A Comprehensive Survey

Kun Shi, Shibo He, Zhenyu Shi et al.

Multi-modal fusion is imperative to the implementation of reliable object detection and tracking in complex environments. Exploiting the synergy of heterogeneous modal information endows perception systems the ability to achieve more comprehensive, robust, and accurate performance. As a nucleus concern in wireless-vision collaboration, radar-camera fusion has prompted prospective research directions owing to its extensive applicability, complementarity, and compatibility. Nonetheless, there still lacks a systematic survey specifically focusing on deep fusion of radar and camera for object detection and tracking. To fill this void, we embark on an endeavor to comprehensively review radar-camera fusion in a holistic way. First, we elaborate on the fundamental principles, methodologies, and applications of radar-camera fusion perception. Next, we delve into the key techniques concerning sensor calibration, modal representation, data alignment, and fusion operation. Furthermore, we provide a detailed taxonomy covering the research topics related to object detection and tracking in the context of radar and camera technologies.Finally, we discuss the emerging perspectives in the field of radar-camera fusion perception and highlight the potential areas for future research.

CVApr 5
OP-GRPO: Efficient Off-Policy GRPO for Flow-Matching Models

Liyu Zhang, Kehan Li, Tingrui Han et al.

Post training via GRPO has demonstrated remarkable effectiveness in improving the generation quality of flow-matching models. However, GRPO suffers from inherently low sample efficiency due to its on-policy training paradigm. To address this limitation, we present OP-GRPO, the first Off-Policy GRPO framework tailored for flow-matching models. First, we actively select high-quality trajectories and adaptively incorporate them into a replay buffer for reuse in subsequent training iterations. Second, to mitigate the distribution shift introduced by off-policy samples, we propose a sequence-level importance sampling correction that preserves the integrity of GRPO's clipping mechanism while ensuring stable policy updates. Third, we theoretically and empirically show that late denoising steps yield ill-conditioned off-policy ratios, and mitigate this by truncating trajectories at late steps. Across image and video generation benchmarks, OP-GRPO achieves comparable or superior performance to Flow-GRPO with only 34.2% of the training steps on average, yielding substantial gains in training efficiency while maintaining generation quality.

CVOct 24, 2025
TokenCLIP: Token-wise Prompt Learning for Zero-shot Anomaly Detection

Qihang Zhou, Binbin Gao, Guansong Pang et al.

Adapting CLIP for anomaly detection on unseen objects has shown strong potential in a zero-shot manner. However, existing methods typically rely on a single textual space to align with visual semantics across diverse objects and domains. The indiscriminate alignment hinders the model from accurately capturing varied anomaly semantics. We propose TokenCLIP, a token-wise adaptation framework that enables dynamic alignment between visual and learnable textual spaces for fine-grained anomaly learning. Rather than mapping all visual tokens to a single, token-agnostic textual space, TokenCLIP aligns each token with a customized textual subspace that represents its visual characteristics. Explicitly assigning a unique learnable textual space to each token is computationally intractable and prone to insufficient optimization. We instead expand the token-agnostic textual space into a set of orthogonal subspaces, and then dynamically assign each token to a subspace combination guided by semantic affinity, which jointly supports customized and efficient token-wise adaptation. To this end, we formulate dynamic alignment as an optimal transport problem, where all visual tokens in an image are transported to textual subspaces based on semantic similarity. The transport constraints of OT ensure sufficient optimization across subspaces and encourage them to focus on different semantics. Solving the problem yields a transport plan that adaptively assigns each token to semantically relevant subspaces. A top-k masking is then applied to sparsify the plan and specialize subspaces for distinct visual regions. Extensive experiments demonstrate the superiority of TokenCLIP.

LGSep 20, 2025
mmExpert: Integrating Large Language Models for Comprehensive mmWave Data Synthesis and Understanding

Yifan Yan, Shuai Yang, Xiuzhen Guo et al.

Millimeter-wave (mmWave) sensing technology holds significant value in human-centric applications, yet the high costs associated with data acquisition and annotation limit its widespread adoption in our daily lives. Concurrently, the rapid evolution of large language models (LLMs) has opened up opportunities for addressing complex human needs. This paper presents mmExpert, an innovative mmWave understanding framework consisting of a data generation flywheel that leverages LLMs to automate the generation of synthetic mmWave radar datasets for specific application scenarios, thereby training models capable of zero-shot generalization in real-world environments. Extensive experiments demonstrate that the data synthesized by mmExpert significantly enhances the performance of downstream models and facilitates the successful deployment of large models for mmWave understanding.

CVSep 3, 2025
PointAD+: Learning Hierarchical Representations for Zero-shot 3D Anomaly Detection

Qihang Zhou, Shibo He, Jiangtao Yan et al.

In this paper, we aim to transfer CLIP's robust 2D generalization capabilities to identify 3D anomalies across unseen objects of highly diverse class semantics. To this end, we propose a unified framework to comprehensively detect and segment 3D anomalies by leveraging both point- and pixel-level information. We first design PointAD, which leverages point-pixel correspondence to represent 3D anomalies through their associated rendering pixel representations. This approach is referred to as implicit 3D representation, as it focuses solely on rendering pixel anomalies but neglects the inherent spatial relationships within point clouds. Then, we propose PointAD+ to further broaden the interpretation of 3D anomalies by introducing explicit 3D representation, emphasizing spatial abnormality to uncover abnormal spatial relationships. Hence, we propose G-aggregation to involve geometry information to enable the aggregated point representations spatially aware. To simultaneously capture rendering and spatial abnormality, PointAD+ proposes hierarchical representation learning, incorporating implicit and explicit anomaly semantics into hierarchical text prompts: rendering prompts for the rendering layer and geometry prompts for the geometry layer. A cross-hierarchy contrastive alignment is further introduced to promote the interaction between the rendering and geometry layers, facilitating mutual anomaly learning. Finally, PointAD+ integrates anomaly semantics from both layers to capture the generalized anomaly semantics. During the test, PointAD+ can integrate RGB information in a plug-and-play manner and further improve its detection performance. Extensive experiments demonstrate the superiority of PointAD+ in ZS 3D anomaly detection across unseen objects with highly diverse class semantics, achieving a holistic understanding of abnormality.

LGAug 25, 2025
Choice Outweighs Effort: Facilitating Complementary Knowledge Fusion in Federated Learning via Re-calibration and Merit-discrimination

Ming Yang, Dongrun Li, Xin Wang et al.

Cross-client data heterogeneity in federated learning induces biases that impede unbiased consensus condensation and the complementary fusion of generalization- and personalization-oriented knowledge. While existing approaches mitigate heterogeneity through model decoupling and representation center loss, they often rely on static and restricted metrics to evaluate local knowledge and adopt global alignment too rigidly, leading to consensus distortion and diminished model adaptability. To address these limitations, we propose FedMate, a method that implements bilateral optimization: On the server side, we construct a dynamic global prototype, with aggregation weights calibrated by holistic integration of sample size, current parameters, and future prediction; a category-wise classifier is then fine-tuned using this prototype to preserve global consistency. On the client side, we introduce complementary classification fusion to enable merit-based discrimination training and incorporate cost-aware feature transmission to balance model performance and communication efficiency. Experiments on five datasets of varying complexity demonstrate that FedMate outperforms state-of-the-art methods in harmonizing generalization and adaptation. Additionally, semantic segmentation experiments on autonomous driving datasets validate the method's real-world scalability.

CVMar 14, 2024
Intention-aware Denoising Diffusion Model for Trajectory Prediction

Chen Liu, Shibo He, Haoyu Liu et al.

Trajectory prediction is an essential component in autonomous driving, particularly for collision avoidance systems. Considering the inherent uncertainty of the task, numerous studies have utilized generative models to produce multiple plausible future trajectories for each agent. However, most of them suffer from restricted representation ability or unstable training issues. To overcome these limitations, we propose utilizing the diffusion model to generate the distribution of future trajectories. Two cruxes are to be settled to realize such an idea. First, the diversity of intention is intertwined with the uncertain surroundings, making the true distribution hard to parameterize. Second, the diffusion process is time-consuming during the inference phase, rendering it unrealistic to implement in a real-time driving system. We propose an Intention-aware denoising Diffusion Model (IDM), which tackles the above two problems. We decouple the original uncertainty into intention uncertainty and action uncertainty and model them with two dependent diffusion processes. To decrease the inference time, we reduce the variable dimensions in the intention-aware diffusion process and restrict the initial distribution of the action-aware diffusion process, which leads to fewer diffusion steps. To validate our approach, we conduct experiments on the Stanford Drone Dataset (SDD) and ETH/UCY dataset. Our methods achieve state-of-the-art results, with an FDE of 13.83 pixels on the SDD dataset and 0.36 meters on the ETH/UCY dataset. Compared with the original diffusion model, IDM reduces inference time by two-thirds. Interestingly, our experiments further reveal that introducing intention information is beneficial in modeling the diffusion process of fewer steps.

LGJan 26, 2024
Large Language Model Guided Knowledge Distillation for Time Series Anomaly Detection

Chen Liu, Shibo He, Qihang Zhou et al.

Self-supervised methods have gained prominence in time series anomaly detection due to the scarcity of available annotations. Nevertheless, they typically demand extensive training data to acquire a generalizable representation map, which conflicts with scenarios of a few available samples, thereby limiting their performance. To overcome the limitation, we propose \textbf{AnomalyLLM}, a knowledge distillation-based time series anomaly detection approach where the student network is trained to mimic the features of the large language model (LLM)-based teacher network that is pretrained on large-scale datasets. During the testing phase, anomalies are detected when the discrepancy between the features of the teacher and student networks is large. To circumvent the student network from learning the teacher network's feature of anomalous samples, we devise two key strategies. 1) Prototypical signals are incorporated into the student network to consolidate the normal feature extraction. 2) We use synthetic anomalies to enlarge the representation gap between the two networks. AnomalyLLM demonstrates state-of-the-art performance on 15 datasets, improving accuracy by at least 14.5\% in the UCR dataset.

IVFeb 8, 2022
Wireless Transmission of Images With The Assistance of Multi-level Semantic Information

Zhenguo Zhang, Qianqian Yang, Shibo He et al.

Semantic-oriented communication has been considered as a promising to boost the bandwidth efficiency by only transmitting the semantics of the data. In this paper, we propose a multi-level semantic aware communication system for wireless image transmission, named MLSC-image, which is based on the deep learning techniques and trained in an end to end manner. In particular, the proposed model includes a multilevel semantic feature extractor, that extracts both the highlevel semantic information, such as the text semantics and the segmentation semantics, and the low-level semantic information, such as local spatial details of the images. We employ a pretrained image caption to capture the text semantics and a pretrained image segmentation model to obtain the segmentation semantics. These high-level and low-level semantic features are then combined and encoded by a joint semantic and channel encoder into symbols to transmit over the physical channel. The numerical results validate the effectiveness and efficiency of the proposed semantic communication system, especially under the limited bandwidth condition, which indicates the advantages of the high-level semantics in the compression of images.

ASFeb 7, 2022
Semantic-aware Speech to Text Transmission with Redundancy Removal

Tianxiao Han, Qianqian Yang, Zhiguo Shi et al.

Deep learning (DL) based semantic communication methods have been explored for the efficient transmission of images, text, and speech in recent years. In contrast to traditional wireless communication methods that focus on the transmission of abstract symbols, semantic communication approaches attempt to achieve better transmission efficiency by only sending the semantic-related information of the source data. In this paper, we consider semantic-oriented speech to text transmission. We propose a novel end-to-end DL-based transceiver, which includes an attention-based soft alignment module and a redundancy removal module to compress the transmitted data. In particular, the former extracts only the text-related semantic features, and the latter further drops the semantically redundant content, greatly reducing the amount of semantic redundancy compared to existing methods. We also propose a two-stage training scheme, which speeds up the training of the proposed DL model. The simulation results indicate that our proposed method outperforms current methods in terms of the accuracy of the received text and transmission efficiency. Moreover, the proposed method also has a smaller model size and shorter end-to-end runtime.

CVJan 30, 2022
Generalized Global Ranking-Aware Neural Architecture Ranker for Efficient Image Classifier Search

Bicheng Guo, Tao Chen, Shibo He et al.

Neural Architecture Search (NAS) is a powerful tool for automating effective image processing DNN designing. The ranking has been advocated to design an efficient performance predictor for NAS. The previous contrastive method solves the ranking problem by comparing pairs of architectures and predicting their relative performance. However, it only focuses on the rankings between two involved architectures and neglects the overall quality distributions of the search space, which may suffer generalization issues. A predictor, namely Neural Architecture Ranker (NAR) which concentrates on the global quality tier of specific architecture, is proposed to tackle such problems caused by the local perspective. The NAR explores the quality tiers of the search space globally and classifies each individual to the tier they belong to according to its global ranking. Thus, the predictor gains the knowledge of the performance distributions of the search space which helps to generalize its ranking ability to the datasets more easily. Meanwhile, the global quality distribution facilitates the search phase by directly sampling candidates according to the statistics of quality tiers, which is free of training a search algorithm, e.g., Reinforcement Learning (RL) or Evolutionary Algorithm (EA), thus it simplifies the NAS pipeline and saves the computational overheads. The proposed NAR achieves better performance than the state-of-the-art methods on two widely used datasets for NAS research. On the vast search space of NAS-Bench-101, the NAR easily finds the architecture with top 0.01$\unicode{x2030}$ performance only by sampling. It also generalizes well to different image datasets of NAS-Bench-201, i.e., CIFAR-10, CIFAR-100, and ImageNet-16-120 by identifying the optimal architectures for each of them.

LGOct 6, 2021
FTPipeHD: A Fault-Tolerant Pipeline-Parallel Distributed Training Framework for Heterogeneous Edge Devices

Yuhao Chen, Qianqian Yang, Shibo He et al.

With the increased penetration and proliferation of Internet of Things (IoT) devices, there is a growing trend towards distributing the power of deep learning (DL) across edge devices rather than centralizing it in the cloud. This development enables better privacy preservation, real-time responses, and user-specific models. To deploy deep and complex models to edge devices with limited resources, model partitioning of deep neural networks (DNN) model is necessary, and has been widely studied. However, most of the existing literature only considers distributing the inference model while still relying centralized cloud infrastructure to generate this model through training. In this paper, we propose FTPipeHD, a novel DNN training framework that trains DNN models across distributed heterogeneous devices with fault tolerance mechanism. To accelerate the training with time-varying computing power of each device, we optimize the partition points dynamically according to real-time computing capacities. We also propose a novel weight redistribution approach that replicates the weights to both the neighboring nodes and the central node periodically, which combats the failure of multiple devices during training while incurring limited communication cost. Our numerical results demonstrate that FTPipeHD is 6.8x faster in training than the state of the art method when the computing capacity of the best device is 10x greater than the worst one. It is also shown that the proposed method is able to accelerate the training even with the existence of device failures.

LGMar 17, 2021
Fairness-aware Outlier Ensemble

Haoyu Liu, Fenglong Ma, Shibo He et al.

Outlier ensemble methods have shown outstanding performance on the discovery of instances that are significantly different from the majority of the data. However, without the awareness of fairness, their applicability in the ethical scenarios, such as fraud detection and judiciary judgement system, could be degraded. In this paper, we propose to reduce the bias of the outlier ensemble results through a fairness-aware ensemble framework. Due to the lack of ground truth in the outlier detection task, the key challenge is how to mitigate the degradation in the detection performance with the improvement of fairness. To address this challenge, we define a distance measure based on the output of conventional outlier ensemble techniques to estimate the possible cost associated with detection performance degradation. Meanwhile, we propose a post-processing framework to tune the original ensemble results through a stacking process so that we can achieve a trade off between fairness and detection performance. Detection performance is measured by the area under ROC curve (AUC) while fairness is measured at both group and individual level. Experiments on eight public datasets are conducted. Results demonstrate the effectiveness of the proposed framework in improving fairness of outlier ensemble results. We also analyze the trade-off between AUC and fairness.

CRDec 30, 2020
PrivSyn: Differentially Private Data Synthesis

Zhikun Zhang, Tianhao Wang, Ninghui Li et al.

In differential privacy (DP), a challenging problem is to generate synthetic datasets that efficiently capture the useful information in the private data. The synthetic dataset enables any task to be done without privacy concern and modification to existing algorithms. In this paper, we present PrivSyn, the first automatic synthetic data generation method that can handle general tabular datasets (with 100 attributes and domain size $>2^{500}$). PrivSyn is composed of a new method to automatically and privately identify correlations in the data, and a novel method to generate sample data from a dense graphic model. We extensively evaluate different methods on multiple datasets to demonstrate the performance of our method.

CVSep 1, 2020
PIDNet: An Efficient Network for Dynamic Pedestrian Intrusion Detection

Jingchen Sun, Jiming Chen, Tao Chen et al.

Vision-based dynamic pedestrian intrusion detection (PID), judging whether pedestrians intrude an area-of-interest (AoI) by a moving camera, is an important task in mobile surveillance. The dynamically changing AoIs and a number of pedestrians in video frames increase the difficulty and computational complexity of determining whether pedestrians intrude the AoI, which makes previous algorithms incapable of this task. In this paper, we propose a novel and efficient multi-task deep neural network, PIDNet, to solve this problem. PIDNet is mainly designed by considering two factors: accurately segmenting the dynamically changing AoIs from a video frame captured by the moving camera and quickly detecting pedestrians from the generated AoI-contained areas. Three efficient network designs are proposed and incorporated into PIDNet to reduce the computational complexity: 1) a special PID task backbone for feature sharing, 2) a feature cropping module for feature cropping, and 3) a lighter detection branch network for feature compression. In addition, considering there are no public datasets and benchmarks in this field, we establish a benchmark dataset to evaluate the proposed network and give the corresponding evaluation metrics for the first time. Experimental results show that PIDNet can achieve 67.1% PID accuracy and 9.6 fps inference speed on the proposed dataset, which serves as a good baseline for the future vision-based dynamic PID study.

CRJan 21, 2020
You foot the bill! Attacking NFC with passive relays

Yuyi Sun, Swarun Kumar, Shibo He et al.

Imagine when you line up in a store, the person in front of you can make you pay her bill by using a passive wearable device that forces a scan of your credit card without your awareness. An important assumption of today's Near-field Communication (NFC) enabled cards is the limited communication range between the commercial reader and the NFC cards -- a distance below 5~cm. Previous approaches to attacking this assumption effectively use mobile phones and active relays to enlarge the communication range, in order to attack the NFC cards. However, these approaches require a power supply at the adversary side, and can be easily localized when mobile phones or active relays transmit NFC signals. We propose ReCoil, a system that uses wearable passive relays to attack NFC cards by expanding the communication range to 49.6 centimeters, a ten-fold improvement over its intended commercial distance. ReCoil is a magnetically coupled resonant wireless power transfer system, which optimizes the energy transfer by searching the optimal geometry parameters. Specifically, we first narrow down the feasible area reasonably and design the ReCoil-Ant Colony Algorithm such that the relays absorb the maximum energy from the reader. In order to reroute the signal to pass over the surface of human body, we then design a half waist band by carefully analyzing the impact of the distance and orientation between two coils on the mutual inductance. Then, three more coils are added to the system to keep enlarging the communication range. Finally, extensive experiment results validate our analysis, showing that our passive relays composed of common copper wires and tunable capacitors expand the range of NFC attacks to 49.6 centimeters.

GTNov 2, 2017
REAP: An Efficient Incentive Mechanism for Reconciling Aggregation Accuracy and Individual Privacy in Crowdsensing

Zhikun Zhang, Shibo He, Jiming Chen et al.

Incentive mechanism plays a critical role in privacy-aware crowdsensing. Most previous studies on co-design of incentive mechanism and privacy preservation assume a trustworthy fusion center (FC). Very recent work has taken steps to relax the assumption on trustworthy FC and allows participatory users (PUs) to add well calibrated noise to their raw sensing data before reporting them, whereas the focus is on the equilibrium behavior of data subjects with binary data. Making a paradigm shift, this paper aim to quantify the privacy compensation for continuous data sensing while allowing FC to directly control PUs. There are two conflicting objectives in such scenario: FC desires better quality data in order to achieve higher aggregation accuracy whereas PUs prefer adding larger noise for higher privacy-preserving levels (PPLs). To achieve a good balance therein, we design an efficient incentive mechanism to REconcile FC's Aggregation accuracy and individual PU's data Privacy (REAP). Specifically, we adopt the celebrated notion of differential privacy to measure PUs' PPLs and quantify their impacts on FC's aggregation accuracy. Then, appealing to Contract Theory, we design an incentive mechanism to maximize FC's aggregation accuracy under a given budget. The proposed incentive mechanism offers different contracts to PUs with different privacy preferences, by which FC can directly control PUs. It can further overcome the information asymmetry, i.e., the FC typically does not know each PU's precise privacy preference. We derive closed-form solutions for the optimal contracts in both complete information and incomplete information scenarios. Further, the results are generalized to the continuous case where PUs' privacy preferences take values in a continuous domain. Extensive simulations are provided to validate the feasibility and advantages of our proposed incentive mechanism.

CRNov 1, 2017
Re-DPoctor: Real-time health data releasing with w-day differential privacy

Jiajun Zhang, Xiaohui Liang, Zhikun Zhang et al.

Wearable devices enable users to collect health data and share them with healthcare providers for improved health service. Since health data contain privacy-sensitive information, unprotected data release system may result in privacy leakage problem. Most of the existing work use differential privacy for private data release. However, they have limitations in healthcare scenarios because they do not consider the unique features of health data being collected from wearables, such as continuous real-time collection and pattern preservation. In this paper, we propose Re-DPoctor, a real-time health data releasing scheme with $w$-day differential privacy where the privacy of health data collected from any consecutive $w$ days is preserved. We improve utility by using a specially-designed partition algorithm to protect the health data patterns. Meanwhile, we improve privacy preservation by applying newly proposed adaptive sampling technique and budget allocation method. We prove that Re-DPoctor satisfies $w$-day differential privacy. Experiments on real health data demonstrate that our method achieves better utility with strong privacy guarantee than existing state-of-the-art methods.