Xiaoming Tao

h-index12

21papers

1,152citations

Novelty50%

AI Score57

Ranked #18,946 of 201,326 authors (top 9%)#8,361 in CV (top 14%)

21 Papers

CVAug 8, 2022

Towards Semantic Communications: Deep Learning-Based Image Semantic Coding

Danlan Huang, Feifei Gao, Xiaoming Tao et al.

Semantic communications has received growing interest since it can remarkably reduce the amount of data to be transmitted without missing critical information. Most existing works explore the semantic encoding and transmission for text and apply techniques in Natural Language Processing (NLP) to interpret the meaning of the text. In this paper, we conceive the semantic communications for image data that is much more richer in semantics and bandwidth sensitive. We propose an reinforcement learning based adaptive semantic coding (RL-ASC) approach that encodes images beyond pixel level. Firstly, we define the semantic concept of image data that includes the category, spatial arrangement, and visual feature as the representation unit, and propose a convolutional semantic encoder to extract semantic concepts. Secondly, we propose the image reconstruction criterion that evolves from the traditional pixel similarity to semantic similarity and perceptual performance. Thirdly, we design a novel RL-based semantic bit allocation model, whose reward is the increase in rate-semantic-perceptual performance after encoding a certain semantic concept with adaptive quantization level. Thus, the task-related information is preserved and reconstructed properly while less important data is discarded. Finally, we propose the Generative Adversarial Nets (GANs) based semantic decoder that fuses both locally and globally features via an attention module. Experimental results demonstrate that the proposed RL-ASC is noise robust and could reconstruct visually pleasant and semantic consistent image, and saves times of bit cost compared to standard codecs and other deep learning-based image codecs.

ROJun 4

A Novel Method with Encoder-Decoder for Cross-Sensor Adaptation in Surface Shape Sensing with Sparse Strain Sensors

Shuo Wang, Heng Luo, Dian Jin et al.

Performance variations in sensor arrays, caused by intrinsic differences or installation conditions, can lead to inconsistent results during shape sensing. To obtain accurate results, a large amount of data is usually required, and a separate model must be retrained for each sensor array, thereby increasing the cost and time of data acquisition, transmission, and computation. To address this issue, this work proposes an encoder-decoder architecture for surface shape sensing based on sparse strain sensors and further incorporates meta-learning and few-shot adaptation strategies to enable adaptation across different groups of sensor arrays. Experimental results demonstrate that, after the cross-sensor adaptation, a newly deployed sensor array achieves a sensing error of approximately 4.0 mm relying on less than 5.0% newly labeled data and requiring an adaptation time of under 1 second, which represents a substantial improvement from 23.0 mm error without adaptation and 20-minute data collection time required to train a new model. Moreover, the number of points with errors below 5.0 mm increased by more than 65.0%. These results indicate that the proposed method can substantially reduce the cost and training burden of surface shape sensing, and it has broad potential applications in soft robotics and wearable devices.

CVSep 23, 2022

Vector Quantized Semantic Communication System

Qifan Fu, Huiqiang Xie, Zhijin Qin et al.

Although analog semantic communication systems have received considerable attention in the literature, there is less work on digital semantic communication systems. In this paper, we develop a deep learning (DL)-enabled vector quantized (VQ) semantic communication system for image transmission, named VQ-DeepSC. Specifically, we propose a convolutional neural network (CNN)-based transceiver to extract multi-scale semantic features of images and introduce multi-scale semantic embedding spaces to perform semantic feature quantization, rendering the data compatible with digital communication systems. Furthermore, we employ adversarial training to improve the quality of received images by introducing a PatchGAN discriminator. Experimental results demonstrate that the proposed VQ-DeepSC is more robustness than BPG in digital communication systems and has comparable MS-SSIM performance to the DeepJSCC method.

ROMay 28

Joint Angle Estimation with Customized Wristband Based on Online Incremental Learning

Shuo Wang, Xiaobin Chen, Xiaoming Tao

Intelligent wearable technology plays an increasingly important role in human-computer interaction, motion, and health monitoring. To ensure comfort and practicality of use, one common form for motion monitoring is to utilize soft wearable sensors. However, many research applications regarding wearable sensors are simplistic and difficult to adapt to different situations. This study proposes a system for estimating the angle of the wrist joint using a customized wristband based on an online incremental learning approach. It is a two-stage estimation method: the first stage updates the model based on the wearer's wrist movement characteristics using online learning, integrating real-time data from an IMU as ground truth. The second stage utilizes the updated model for estimation of wrist joint angle solely with the wristband. In other words, model training is completed during data acquisition, allowing the trained model to be used for subsequent angle estimation. This method offers advantages in adapting to data drift caused by variations in different testing configurations, such as the left and right wrists of the same subject, deviations in the wearing position on the same wrist, and even differences among various subjects. The results indicate that the sensors exhibit good performance under strain variations, and the wrist joint trajectory estimation of the proposed system has an approximate error of 15 degree in different scenarios.

CVDec 12, 2025Code

Assisted Refinement Network Based on Channel Information Interaction for Camouflaged and Salient Object Detection

Kuan Wang, Yanjun Qin, Mengge Lu et al.

Camouflaged Object Detection (COD) stands as a significant challenge in computer vision, dedicated to identifying and segmenting objects visually highly integrated with their backgrounds. Current mainstream methods have made progress in cross-layer feature fusion, but two critical issues persist during the decoding stage. The first is insufficient cross-channel information interaction within the same-layer features, limiting feature expressiveness. The second is the inability to effectively co-model boundary and region information, making it difficult to accurately reconstruct complete regions and sharp boundaries of objects. To address the first issue, we propose the Channel Information Interaction Module (CIIM), which introduces a horizontal-vertical integration mechanism in the channel dimension. This module performs feature reorganization and interaction across channels to effectively capture complementary cross-channel information. To address the second issue, we construct a collaborative decoding architecture guided by prior knowledge. This architecture generates boundary priors and object localization maps through Boundary Extraction (BE) and Region Extraction (RE) modules, then employs hybrid attention to collaboratively calibrate decoded features, effectively overcoming semantic ambiguity and imprecise boundaries. Additionally, the Multi-scale Enhancement (MSE) module enriches contextual feature representations. Extensive experiments on four COD benchmark datasets validate the effectiveness and state-of-the-art performance of the proposed model. We further transferred our model to the Salient Object Detection (SOD) task and demonstrated its adaptability across downstream tasks, including polyp segmentation, transparent object detection, and industrial and road defect detection. Code and experimental results are publicly available at: https://github.com/akuan1234/ARNet-v2.

IVAug 8, 2024

Synchronous Multi-modal Semantic Communication System with Packet-level Coding

Yun Tian, Jingkai Ying, Zhijin Qin et al.

Although the semantic communication with joint semantic-channel coding design has shown promising performance in transmitting data of different modalities over physical layer channels, the synchronization and packet-level forward error correction of multimodal semantics have not been well studied. Due to the independent design of semantic encoders, synchronizing multimodal features in both the semantic and time domains is a challenging problem. In this paper, we take the facial video and speech transmission as an example and propose a Synchronous Multimodal Semantic Communication System (SyncSC) with Packet-Level Coding. To achieve semantic and time synchronization, 3D Morphable Mode (3DMM) coefficients and text are transmitted as semantics, and we propose a semantic codec that achieves similar quality of reconstruction and synchronization with lower bandwidth, compared to traditional methods. To protect semantic packets under the erasure channel, we propose a packet-Level Forward Error Correction (FEC) method, called PacSC, that maintains a certain visual quality performance even at high packet loss rates. Particularly, for text packets, a text packet loss concealment module, called TextPC, based on Bidirectional Encoder Representations from Transformers (BERT) is proposed, which significantly improves the performance of traditional FEC methods. The simulation results show that our proposed SyncSC reduce transmission overhead and achieve high-quality synchronous transmission of video and speech over the packet loss network.

LGMar 31, 2023

A robust deep learning-based damage identification approach for SHM considering missing data

Fan Deng, Xiaoming Tao, Pengxiang Wei et al.

Data-driven method for Structural Health Monitoring (SHM), that mine the hidden structural performance from the correlations among monitored time series data, has received widely concerns recently. However, missing data significantly impacts the conduction of this method. Missing data is a frequently encountered issue in time series data in SHM and many other real-world applications, that harms to the standardized data mining and downstream tasks, such as condition assessment. Imputation approaches based on spatiotemporal relations among monitoring data are developed to handle this issue, however, no additional information is added during imputation. This paper thus develops a robust method for damage identification that considers the missing data occasions, based on long-short term memory (LSTM) model and dropout mechanism in the autoencoder (AE) framework. Inputs channels are randomly dropped to simulate the missing data in training, and reconstruction errors are used as the loss function and the damage indicator. Quasi-static response (cable tension) of a cable-stayed bridge released in 1st IPC-SHM is employed to verify this proposed method, and results show that the missing data imputation and damage identification can be implemented together in a unified way.

CVApr 19

Frequency-guided Multi-level Reasoning for Scene Graph Generation in Video

Chenxing Li, Yiping Duan, Xiaoming Tao

Video Scene Graph Generation aims to obtain structured semantic representations of objects and their relationships in videos for high-level understanding. However, existing methods still have limitations in handling long-tail distributions. This paper proposes the Frequency-guided Relational Multi-level Reasoning (FReMuRe) model, which enhances the modeling ability of long-tail relationships from a mechanism perspective. We introduce relation-specific branches to deal gradient conflicts, yielding more balanced and tail-aware learning. And we design a frequency-aware dual-branch predicate embedding network to model high-frequency and low-frequency relationships separately and improve the recall rate of tail classes through gated fusion. Meanwhile, we propose two types of interchangeable relation classification heads: Bayesian Head for uncertainty estimation and new Gaussian Mixture Model Head to enhance intra-class diversity. Experimental results show that FReMuRe significantly improves the recall rate of long-tail relationships and overall reasoning robustness on the Action Genome dataset.

CLJun 2, 2025Code

Argument-Centric Causal Intervention Method for Mitigating Bias in Cross-Document Event Coreference Resolution

Long Yao, Wenzhong Yang, Yabo Yin et al.

Cross-document Event Coreference Resolution (CD-ECR) is a fundamental task in natural language processing (NLP) that seeks to determine whether event mentions across multiple documents refer to the same real-world occurrence. However, current CD-ECR approaches predominantly rely on trigger features within input mention pairs, which induce spurious correlations between surface-level lexical features and coreference relationships, impairing the overall performance of the models. To address this issue, we propose a novel cross-document event coreference resolution method based on Argument-Centric Causal Intervention (ACCI). Specifically, we construct a structural causal graph to uncover confounding dependencies between lexical triggers and coreference labels, and introduce backdoor-adjusted interventions to isolate the true causal effect of argument semantics. To further mitigate spurious correlations, ACCI integrates a counterfactual reasoning module that quantifies the causal influence of trigger word perturbations, and an argument-aware enhancement module to promote greater sensitivity to semantically grounded information. In contrast to prior methods that depend on costly data augmentation or heuristic-based filtering, ACCI enables effective debiasing in a unified end-to-end framework without altering the underlying training procedure. Extensive experiments demonstrate that ACCI achieves CoNLL F1 of 88.4% on ECB+ and 85.2% on GVC, achieving state-of-the-art performance. The implementation and materials are available at https://github.com/era211/ACCI.

IVDec 18, 2023

Federated Multi-View Synthesizing for Metaverse

Yiyu Guo, Zhijin Qin, Xiaoming Tao et al.

The metaverse is expected to provide immersive entertainment, education, and business applications. However, virtual reality (VR) transmission over wireless networks is data- and computation-intensive, making it critical to introduce novel solutions that meet stringent quality-of-service requirements. With recent advances in edge intelligence and deep learning, we have developed a novel multi-view synthesizing framework that can efficiently provide computation, storage, and communication resources for wireless content delivery in the metaverse. We propose a three-dimensional (3D)-aware generative model that uses collections of single-view images. These single-view images are transmitted to a group of users with overlapping fields of view, which avoids massive content transmission compared to transmitting tiles or whole 3D models. We then present a federated learning approach to guarantee an efficient learning process. The training performance can be improved by characterizing the vertical and horizontal data samples with a large latent feature space, while low-latency communication can be achieved with a reduced number of transmitted parameters during federated learning. We also propose a federated transfer learning framework to enable fast domain adaptation to different target domains. Simulation results have demonstrated the effectiveness of our proposed federated multi-view synthesizing framework for VR content delivery.

AIMar 28, 2025

Multi-Task Semantic Communications via Large Models

Wanli Ni, Zhijin Qin, Haofeng Sun et al.

Artificial intelligence (AI) promises to revolutionize the design, optimization and management of next-generation communication systems. In this article, we explore the integration of large AI models (LAMs) into semantic communications (SemCom) by leveraging their multi-modal data processing and generation capabilities. Although LAMs bring unprecedented abilities to extract semantics from raw data, this integration entails multifaceted challenges including high resource demands, model complexity, and the need for adaptability across diverse modalities and tasks. To overcome these challenges, we propose a LAM-based multi-task SemCom (MTSC) architecture, which includes an adaptive model compression strategy and a federated split fine-tuning approach to facilitate the efficient deployment of LAM-based semantic models in resource-limited networks. Furthermore, a retrieval-augmented generation scheme is implemented to synthesize the most recent local and global knowledge bases to enhance the accuracy of semantic extraction and content generation, thereby improving the inference performance. Finally, simulation results demonstrate the efficacy of the proposed LAM-based MTSC architecture, highlighting the performance enhancements across various downstream tasks under varying channel conditions.

LGFeb 4

Efficient Equivariant High-Order Crystal Tensor Prediction via Cartesian Local-Environment Many-Body Coupling

Dian Jin, Yancheng Yuan, Xiaoming Tao

End-to-end prediction of high-order crystal tensor properties from atomic structures remains challenging: while spherical-harmonic equivariant models are expressive, their Clebsch-Gordan tensor products incur substantial compute and memory costs for higher-order targets. We propose the Cartesian Environment Interaction Tensor Network (CEITNet), an approach that constructs a multi-channel Cartesian local environment tensor for each atom and performs flexible many-body mixing via a learnable channel-space interaction. By performing learning in channel space and using Cartesian tensor bases to assemble equivariant outputs, CEITNet enables efficient construction of high-order tensor. Across benchmark datasets for order-2 dielectric, order-3 piezoelectric, and order-4 elastic tensor prediction, CEITNet surpasses prior high-order prediction methods on key accuracy criteria while offering high computational efficiency.

SPNov 19, 2025

Joint Semantic-Channel Coding and Modulation for Token Communications

Jingkai Ying, Zhijin Qin, Yulong Feng et al.

In recent years, the Transformer architecture has achieved outstanding performance across a wide range of tasks and modalities. Token is the unified input and output representation in Transformer-based models, which has become a fundamental information unit. In this work, we consider the problem of token communication, studying how to transmit tokens efficiently and reliably. Point cloud, a prevailing three-dimensional format which exhibits a more complex spatial structure compared to image or video, is chosen to be the information source. We utilize the set abstraction method to obtain point tokens. Subsequently, to get a more informative and transmission-friendly representation based on tokens, we propose a joint semantic-channel and modulation (JSCCM) scheme for the token encoder, mapping point tokens to standard digital constellation points (modulated tokens). Specifically, the JSCCM consists of two parallel Point Transformer-based encoders and a differential modulator which combines the Gumel-softmax and soft quantization methods. Besides, the rate allocator and channel adapter are developed, facilitating adaptive generation of high-quality modulated tokens conditioned on both semantic information and channel conditions. Extensive simulations demonstrate that the proposed method outperforms both joint semantic-channel coding and traditional separate coding, achieving over 1dB gain in reconstruction and more than 6x compression ratio in modulated symbols.

LGNov 10, 2025

Magnitude-Modulated Equivariant Adapter for Parameter-Efficient Fine-Tuning of Equivariant Graph Neural Networks

Dian Jin, Yancheng Yuan, Xiaoming Tao

Pretrained equivariant graph neural networks based on spherical harmonics offer efficient and accurate alternatives to computationally expensive ab-initio methods, yet adapting them to new tasks and chemical environments still requires fine-tuning. Conventional parameter-efficient fine-tuning (PEFT) techniques, such as Adapters and LoRA, typically break symmetry, making them incompatible with those equivariant architectures. ELoRA, recently proposed, is the first equivariant PEFT method. It achieves improved parameter efficiency and performance on many benchmarks. However, the relatively high degrees of freedom it retains within each tensor order can still perturb pretrained feature distributions and ultimately degrade performance. To address this, we present Magnitude-Modulated Equivariant Adapter (MMEA), a novel equivariant fine-tuning method which employs lightweight scalar gating to modulate feature magnitudes on a per-order and per-multiplicity basis. We demonstrate that MMEA preserves strict equivariance and, across multiple benchmarks, consistently improves energy and force predictions to state-of-the-art levels while training fewer parameters than competing approaches. These results suggest that, in many practical scenarios, modulating channel magnitudes is sufficient to adapt equivariant models to new chemical environments without breaking symmetry, pointing toward a new paradigm for equivariant PEFT design.

CVMay 12, 2025

Few-shot Semantic Encoding and Decoding for Video Surveillance

Baoping Cheng, Yukun Zhang, Liming Wang et al.

With the continuous increase in the number and resolution of video surveillance cameras, the burden of transmitting and storing surveillance video is growing. Traditional communication methods based on Shannon's theory are facing optimization bottlenecks. Semantic communication, as an emerging communication method, is expected to break through this bottleneck and reduce the storage and transmission consumption of video. Existing semantic decoding methods often require many samples to train the neural network for each scene, which is time-consuming and labor-intensive. In this study, a semantic encoding and decoding method for surveillance video is proposed. First, the sketch was extracted as semantic information, and a sketch compression method was proposed to reduce the bit rate of semantic information. Then, an image translation network was proposed to translate the sketch into a video frame with a reference frame. Finally, a few-shot sketch decoding network was proposed to reconstruct video from sketch. Experimental results showed that the proposed method achieved significantly better video reconstruction performance than baseline methods. The sketch compression method could effectively reduce the storage and transmission consumption of semantic information with little compromise on video quality. The proposed method provides a novel semantic encoding and decoding method that only needs a few training samples for each surveillance scene, thus improving the practicality of the semantic communication system.

IVJun 15, 2024

Object-Attribute-Relation Representation Based Video Semantic Communication

Qiyuan Du, Yiping Duan, Qianqian Yang et al.

With the rapid growth of multimedia data volume, there is an increasing need for efficient video transmission in applications such as virtual reality and future video streaming services. Semantic communication is emerging as a vital technique for ensuring efficient and reliable transmission in low-bandwidth, high-noise settings. However, most current approaches focus on joint source-channel coding (JSCC) that depends on end-to-end training. These methods often lack an interpretable semantic representation and struggle with adaptability to various downstream tasks. In this paper, we introduce the use of object-attribute-relation (OAR) as a semantic framework for videos to facilitate low bit-rate coding and enhance the JSCC process for more effective video transmission. We utilize OAR sequences for both low bit-rate representation and generative video reconstruction. Additionally, we incorporate OAR into the image JSCC model to prioritize communication resources for areas more critical to downstream tasks. Our experiments on traffic surveillance video datasets assess the effectiveness of our approach in terms of video transmission performance. The empirical findings demonstrate that our OAR-based video coding method not only outperforms H.265 coding at lower bit-rates but also synergizes with JSCC to deliver robust and efficient video transmission.

ITDec 30, 2021

Semantic Communications: Principles and Challenges

Zhijin Qin, Xiaoming Tao, Jianhua Lu et al.

Semantic communication, regarded as the breakthrough beyond the Shannon paradigm, aims at the successful transmission of semantic information conveyed by the source rather than the accurate reception of each single symbol or bit regardless of its meaning. This article provides an overview on semantic communications. After a brief review of Shannon information theory, we discuss semantic communications with theory, framework, and system design enabled by deep learning. Different from the symbol/bit error rate used for measuring conventional communication systems, performance metrics for semantic communications are also discussed. The article concludes with several open questions in semantic communications.

IVMar 4, 2021

Perceptual Image Restoration with High-Quality Priori and Degradation Learning

Chaoyi Han, Yiping Duan, Xiaoming Tao et al.

Perceptual image restoration seeks for high-fidelity images that most likely degrade to given images. For better visual quality, previous work proposed to search for solutions within the natural image manifold, by exploiting the latent space of a generative model. However, the quality of generated images are only guaranteed when latent embedding lies close to the prior distribution. In this work, we propose to restrict the feasible region within the prior manifold. This is accomplished with a non-parametric metric for two distributions: the Maximum Mean Discrepancy (MMD). Moreover, we model the degradation process directly as a conditional distribution. We show that our model performs well in measuring the similarity between restored and degraded images. Instead of optimizing the long criticized pixel-wise distance over degraded images, we rely on such model to find visual pleasing images with high probability. Our simultaneous restoration and enhancement framework generalizes well to real-world complicated degradation types. The experimental results on perceptual quality and no-reference image quality assessment (NR-IQA) demonstrate the superior performance of our method.

CVApr 15, 2019

Saliency Prediction on Omnidirectional Images with Generative Adversarial Imitation Learning

Mai Xu, Li Yang, Xiaoming Tao et al.

When watching omnidirectional images (ODIs), subjects can access different viewports by moving their heads. Therefore, it is necessary to predict subjects' head fixations on ODIs. Inspired by generative adversarial imitation learning (GAIL), this paper proposes a novel approach to predict saliency of head fixations on ODIs, named SalGAIL. First, we establish a dataset for attention on ODIs (AOI). In contrast to traditional datasets, our AOI dataset is large-scale, which contains the head fixations of 30 subjects viewing 600 ODIs. Next, we mine our AOI dataset and determine three findings: (1) The consistency of head fixations are consistent among subjects, and it grows alongside the increased subject number; (2) The head fixations exist with a front center bias (FCB); and (3) The magnitude of head movement is similar across subjects. According to these findings, our SalGAIL approach applies deep reinforcement learning (DRL) to predict the head fixations of one subject, in which GAIL learns the reward of DRL, rather than the traditional human-designed reward. Then, multi-stream DRL is developed to yield the head fixations of different subjects, and the saliency map of an ODI is generated via convoluting predicted head fixations. Finally, experiments validate the effectiveness of our approach in predicting saliency maps of ODIs, significantly better than 10 state-of-the-art approaches.

CVMar 5, 2019

A DenseNet Based Approach for Multi-Frame In-Loop Filter in HEVC

Tianyi Li, Mai Xu, Ren Yang et al.

High efficiency video coding (HEVC) has brought outperforming efficiency for video compression. To reduce the compression artifacts of HEVC, we propose a DenseNet based approach as the in-loop filter of HEVC, which leverages multiple adjacent frames to enhance the quality of each encoded frame. Specifically, the higher-quality frames are found by a reference frame selector (RFS). Then, a deep neural network for multi-frame in-loop filter (named MIF-Net) is developed to enhance the quality of each encoded frame by utilizing the spatial information of this frame and the temporal information of its neighboring higher-quality frames. The MIF-Net is built on the recently developed DenseNet, benefiting from the improved generalization capacity and computational efficiency. Finally, experimental results verify the effectiveness of our multi-frame in-loop filter, outperforming the HM baseline and other state-of-the-art approaches.

MMOct 8, 2016

Saliency-Guided Complexity Control for HEVC Decoding

Ren Yang, Mai Xu, Zulin Wang et al.

The latest High Efficiency Video Coding (HEVC) standard significantly improves coding efficiency over its previous video coding standards. The expense of such improvement is enormous computational complexity, from both encoding and decoding sides. Since computational capability and power capacity are diverse across portable devices, it is necessary to reduce decoding complexity to a target with tolerable quality loss, so called complexity control. This paper proposes a Saliency-Guided Complexity Control (SGCC) approach for HEVC decoding, which reduces the decoding complexity to the target with minimal perceptual quality loss. First, we establish the SGCC formulation to minimize perceptual quality loss at the constraint on reduced decoding complexity, which is achieved via disabling Deblocking Filter (DF) and simplifying Motion Compensation (MC) of some non-salient Coding Tree Units (CTUs). One important component in this formulation is the modelled relationship between decoding complexity reduction and DF disabling/MC simplification, which determines the control accuracy of our approach. Another component is the modelled relationship between quality loss and DF disabling/MC simplification, responsible for optimizing perceptual quality. By solving the SGCC formulation for a given target complexity, we can obtain the DF and MC settings of each CTU, and then decoding complexity can be reduced to the target. Finally, the experimental results validate the effectiveness of our SGCC approach, from the aspects of control performance, complexity-distortion performance, fluctuation of quality loss and subjective quality.