Jian Xiao

CV
h-index73
12papers
31citations
Novelty50%
AI Score53

12 Papers

CVSep 26, 2023
Divide and Conquer in Video Anomaly Detection: A Comprehensive Review and New Approach

Jian Xiao, Tianyuan Liu, Genlin Ji

Video anomaly detection is a complex task, and the principle of "divide and conquer" is often regarded as an effective approach to tackling intricate issues. It's noteworthy that recent methods in video anomaly detection have revealed the application of the divide and conquer philosophy (albeit with distinct perspectives from traditional usage), yielding impressive outcomes. This paper systematically reviews these literatures from six dimensions, aiming to enhance the use of the divide and conquer strategy in video anomaly detection. Furthermore, based on the insights gained from this review, a novel approach is presented, which integrates human skeletal frameworks with video data analysis techniques. This method achieves state-of-the-art performance on the ShanghaiTech dataset, surpassing all existing advanced methods.

6.2ITApr 20
Channel Estimation for Rydberg Atomic Quantum Receivers: Unrolled Phase Retrieval from Holographic Snapshots

Jian Xiao, Ji Wang, Ming Zeng et al.

A model-driven deep learning framework is proposed for channel estimation in Rydberg atomic quantum receivers (RAQRs) based on the measurement of holographic snapshots. Specifically, we develop a Transformer-based unrolling architecture, termed URformer, to solve the non-linear biased phase retrieval problem, which is derived by unrolling a stabilized variant of the expectation-maximization Gerchberg-Saxton (EM-GS) algorithm. Each layer of the proposed URformer incorporates three trainable modules: 1) a learnable filter network that replaces the fixed Bessel kernel in the classic EM-GS algorithm; 2) a trainable gating mechanism that adaptively combines classic updates to ensure training stability; and 3) an efficient channel Transformer module that learns to correct residual errors by capturing non-local channel dependencies. Numerical results demonstrate that the proposed URformer significantly outperforms classic iterative algorithms and conventional black-box neural networks with less pilot overhead.

64.2ITMar 25
Wireless AI Evolution: From Statistical Learners to Electromagnetic-Guided Foundation Models

Jian Xiao, Ji Wang, Kunrui Cao et al.

While initial applications of artificial intelligence (AI) in wireless communications over the past decade have demonstrated considerable potential using specialized models for targeted communication tasks, the revolutionary demands of sixth-generation (6G) networks for holographic communications, ubiquitous sensing, and native intelligence are propelling a necessary evolution towards AI-native wireless networks. The arrival of large AI models paves the way for the next phase of Wireless AI, driven by wireless foundation models (WFMs). In particular, pre-training on universal electromagnetic (EM) principles equips WFMs with the essential adaptability for a multitude of demanding 6G applications. However, existing large AI models face critical limitations, including pre-training strategies disconnected from EM-compliant constraints leading to physically inconsistent predictions, a lack of embedded understanding of wave propagation physics, and the inaccessibility of massive labeled datasets for comprehensive EM-aware training. To address these challenges, this article presents an electromagnetic information theory-guided self-supervised pre-training (EIT-SPT) framework designed to systematically inject EM physics into WFMs. The EIT-SPT framework aims to infuse WFMs with intrinsic EM knowledge, thereby enhancing their physical consistency, generalization capabilities across varied EM landscapes, and overall data efficiency. Building upon the proposed EIT-SPT framework, this article first elaborates on diverse potential applications in 6G scenarios of WFMs, then validates the efficacy of the proposed framework through illustrative case studies, and finally summarizes critical open research challenges and future directions for WFMs.

47.1ITMar 19
Channel Estimation for Flexible Intelligent Metasurfaces: From Model-Based Approaches to Neural Operators

Jian Xiao, Ji Wang, Qimei Cui et al.

Flexible intelligent metasurfaces (FIMs) offer a new solution for wireless communications by introducing morphological degrees of freedom, dynamically morphing their three-dimensional shape to ensure multipath signals interfere constructively. However, realizing the desired performance gains in FIM systems is critically dependent on acquiring accurate channel state information across a continuous and high-dimensional deformation space. Therefore, this paper investigates this fundamental channel estimation problem for FIM assisted millimeter-wave communication systems. First, we develop model-based frameworks that structure the problem as either function approximation using interpolation and kernel methods or as a sparse signal recovery problem that leverages the inherent angular sparsity of millimeter-wave channels. To further advance the estimation capability beyond explicit assumptions in model-based channel estimation frameworks, we propose a deep learning-based framework using a Fourier neural operator (FNO). By parameterizing a global convolution operator in the Fourier domain, we design an efficient FNO architecture to learn the continuous operator that maps FIM shapes to channel responses with mesh-independent properties. Furthermore, we exploit a hierarchical FNO (H-FNO) architecture to efficiently capture the multi-scale features across a hierarchy of spatial resolutions. Numerical results demonstrate that the proposed H-FNO significantly outperforms the model-based benchmarks in estimation accuracy and pilot efficiency. In particular, the interpretability analysis show that the proposed H-FNO learns an anisotropic spatial filter adapted to the physical geometry of FIM and is capable of accurately reconstructing the non-linear channel response across the continuous deformation space.

CVSep 27, 2023
Human Kinematics-inspired Skeleton-based Video Anomaly Detection

Jian Xiao, Tianyuan Liu, Genlin Ji

Previous approaches to detecting human anomalies in videos have typically relied on implicit modeling by directly applying the model to video or skeleton data, potentially resulting in inaccurate modeling of motion information. In this paper, we conduct an exploratory study and introduce a new idea called HKVAD (Human Kinematic-inspired Video Anomaly Detection) for video anomaly detection, which involves the explicit use of human kinematic features to detect anomalies. To validate the effectiveness and potential of this perspective, we propose a pilot method that leverages the kinematic features of the skeleton pose, with a specific focus on the walking stride, skeleton displacement at feet level, and neck level. Following this, the method employs a normalizing flow model to estimate density and detect anomalies based on the estimated density. Based on the number of kinematic features used, we have devised three straightforward variant methods and conducted experiments on two highly challenging public datasets, ShanghaiTech and UBnormal. Our method achieves good results with minimal computational resources, validating its effectiveness and potential.

93.7ITMar 25
Rydberg Atomic Quantum Receivers for Wireless Communications: Two-Color vs. Three-Color Excitation

Jian Xiao, Tierui Gong, Ji Wang et al.

An efficient three-color (3C) laser excitation-based Rydberg atomic quantum receiver (RAQR) architecture is investigated for wireless communications, utilizing a five-level (5L) electronic transition mechanism. Specifically, the conventional two-color (2C) RAQR with the four-level (4L) excitation faces three fundamental obstacles: 1) high cost and engineering challenges due to the reliance on unstable blue lasers; 2) a fundamental sensitivity limit in thermal atoms caused by residual Doppler broadening; and 3) the inability to detect low-frequency bands due to the energy-level constraint of two-photon resonance. To address these challenges, this paper analyzes a 3C5L-RAQR architecture with all-red/infrared lasers, which not only solves the engineering cost issues but also enables effective Doppler cancellation and low-frequency detection by exhibiting the three-photon resonance. Bridging atomic physics and communication theory, an end-to-end equivalent baseband signal model is derived. Furthermore, the performance of different RAQR architectures is evaluated in terms of sensitivity, achievable capacity and spectrum access range. Moreover, we provide an exact numerical solution for practical RAQRs by employing the Liouvillian superoperator formalism. Numerical results demonstrate that the exhibited 3C5L-RAQR achieves superior sensitivity compared to the conventional 2C4L-RAQR and the classical receiver based on the conductor antenna. Finally, the inherent sensitivity-capacity trade-off is revealed, showing that the 3C5L-RAQR is more suitable for deployment in power-limited communication scenarios demanding broad spectrum access.

CVJun 12, 2025Code
Occlusion-Aware 3D Hand-Object Pose Estimation with Masked AutoEncoders

Hui Yang, Wei Sun, Jian Liu et al.

Hand-object pose estimation from monocular RGB images remains a significant challenge mainly due to the severe occlusions inherent in hand-object interactions. Existing methods do not sufficiently explore global structural perception and reasoning, which limits their effectiveness in handling occluded hand-object interactions. To address this challenge, we propose an occlusion-aware hand-object pose estimation method based on masked autoencoders, termed as HOMAE. Specifically, we propose a target-focused masking strategy that imposes structured occlusion on regions of hand-object interaction, encouraging the model to learn context-aware features and reason about the occluded structures. We further integrate multi-scale features extracted from the decoder to predict a signed distance field (SDF), capturing both global context and fine-grained geometry. To enhance geometric perception, we combine the implicit SDF with an explicit point cloud derived from the SDF, leveraging the complementary strengths of both representations. This fusion enables more robust handling of occluded regions by combining the global context from the SDF with the precise local geometry provided by the point cloud. Extensive experiments on challenging DexYCB and HO3Dv2 benchmarks demonstrate that HOMAE achieves state-of-the-art performance in hand-object pose estimation. We will release our code and model.

CVMay 18, 2025Code
Rebalancing Contrastive Alignment with Bottlenecked Semantic Increments in Text-Video Retrieval

Jian Xiao, Zijie Song, Jialong Hu et al.

Recent progress in text-video retrieval has been largely driven by contrastive learning. However, existing methods often overlook the effect of the modality gap, which causes anchor representations to undergo in-place optimization (i.e., optimization tension) that limits their alignment capacity. Moreover, noisy hard negatives further distort the semantics of anchors. To address these issues, we propose GARE, a Gap-Aware Retrieval framework that introduces a learnable, pair-specific increment $Δ_{ij}$ between text $t_i$ and video $v_j$, redistributing gradients to relieve optimization tension and absorb noise. We derive $Δ_{ij}$ via a multivariate first-order Taylor expansion of the InfoNCE loss under a trust-region constraint, showing that it guides updates along locally consistent descent directions. A lightweight neural module conditioned on the semantic gap couples increments across batches for structure-aware correction. Furthermore, we regularize $Δ$ through a variational information bottleneck with relaxed compression, enhancing stability and semantic consistency. Experiments on four benchmarks demonstrate that GARE consistently improves alignment accuracy and robustness, validating the effectiveness of gap-aware tension mitigation. Code is available at https://github.com/musicman217/GARE-text-video-retrieval.

AIFeb 2
Efficient Cross-Architecture Knowledge Transfer for Large-Scale Online User Response Prediction

Yucheng Wu, Yuekui Yang, Hongzheng Li et al.

Deploying new architectures in large-scale user response prediction systems incurs high model switching costs due to expensive retraining on massive historical data and performance degradation under data retention constraints. Existing knowledge distillation methods struggle with architectural heterogeneity and the prohibitive cost of transferring large embedding tables. We propose CrossAdapt, a two-stage framework for efficient cross-architecture knowledge transfer. The offline stage enables rapid embedding transfer via dimension-adaptive projections without iterative training, combined with progressive network distillation and strategic sampling to reduce computational cost. The online stage introduces asymmetric co-distillation, where students update frequently while teachers update infrequently, together with a distribution-aware adaptation mechanism that dynamically balances historical knowledge preservation and fast adaptation to evolving data. Experiments on three public datasets show that CrossAdapt achieves 0.27-0.43% AUC improvements while reducing training time by 43-71%. Large-scale deployment on Tencent WeChat Channels (~10M daily samples) further demonstrates its effectiveness, significantly mitigating AUC degradation, LogLoss increase, and prediction bias compared to standard distillation baselines.

LGNov 29, 2024
CAdam: Confidence-Based Optimization for Online Learning

Shaowen Wang, Anan Liu, Jian Xiao et al.

Modern recommendation systems frequently employ online learning to dynamically update their models with freshly collected data. The most commonly used optimizer for updating neural networks in these contexts is the Adam optimizer, which integrates momentum ($m_t$) and adaptive learning rate ($v_t$). However, the volatile nature of online learning data, characterized by its frequent distribution shifts and presence of noise, poses significant challenges to Adam's standard optimization process: (1) Adam may use outdated momentum and the average of squared gradients, resulting in slower adaptation to distribution changes, and (2) Adam's performance is adversely affected by data noise. To mitigate these issues, we introduce CAdam, a confidence-based optimization strategy that assesses the consistency between the momentum and the gradient for each parameter dimension before deciding on updates. If momentum and gradient are in sync, CAdam proceeds with parameter updates according to Adam's original formulation; if not, it temporarily withholds updates and monitors potential shifts in data distribution in subsequent iterations. This method allows CAdam to distinguish between the true distributional shifts and mere noise, and to adapt more quickly to new data distributions. In various settings with distribution shift or noise, our experiments demonstrate that CAdam surpasses other well-known optimizers, including the original Adam. Furthermore, in large-scale A/B testing within a live recommendation system, CAdam significantly enhances model performance compared to Adam, leading to substantial increases in the system's gross merchandise volume (GMV).

SPJul 26, 2025
Deep Learning Based Joint Channel Estimation and Positioning for Sparse XL-MIMO OFDM Systems

Zhongnian Li, Chao Zheng, Jian Xiao et al.

This paper investigates joint channel estimation and positioning in near-field sparse extra-large multiple-input multiple-output (XL-MIMO) orthogonal frequency division multiplexing (OFDM) systems. To achieve cooperative gains between channel estimation and positioning, we propose a deep learning-based two-stage framework comprising positioning and channel estimation. In the positioning stage, the user's coordinates are predicted and utilized in the channel estimation stage, thereby enhancing the accuracy of channel estimation. Within this framework, we propose a U-shaped Mamba architecture for channel estimation and positioning, termed as CP-Mamba. This network integrates the strengths of the Mamba model with the structural advantages of U-shaped convolutional networks, enabling effective capture of local spatial features and long-range temporal dependencies of the channel. Numerical simulation results demonstrate that the proposed two-stage approach with CP-Mamba architecture outperforms existing baseline methods. Moreover, sparse arrays (SA) exhibit significantly superior performance in both channel estimation and positioning accuracy compared to conventional compact arrays.

CVJun 5, 2024
DA-Flow: Dual Attention Normalizing Flow for Skeleton-based Video Anomaly Detection

Ruituo Wu, Yang Chen, Jian Xiao et al.

Cooperation between temporal convolutional networks (TCN) and graph convolutional networks (GCN) as a processing module has shown promising results in skeleton-based video anomaly detection (SVAD). However, to maintain a lightweight model with low computational and storage complexity, shallow GCN and TCN blocks are constrained by small receptive fields and a lack of cross-dimension interaction capture. To tackle this limitation, we propose a lightweight module called the Dual Attention Module (DAM) for capturing cross-dimension interaction relationships in spatio-temporal skeletal data. It employs the frame attention mechanism to identify the most significant frames and the skeleton attention mechanism to capture broader relationships across fixed partitions with minimal parameters and flops. Furthermore, the proposed Dual Attention Normalizing Flow (DA-Flow) integrates the DAM as a post-processing unit after GCN within the normalizing flow framework. Simulations show that the proposed model is robust against noise and negative samples. Experimental results show that DA-Flow reaches competitive or better performance than the existing state-of-the-art (SOTA) methods in terms of the micro AUC metric with the fewest number of parameters. Moreover, we found that even without training, simply using random projection without dimensionality reduction on skeleton data enables substantial anomaly detection capabilities.