Xin Gong

CV
h-index20
10papers
50citations
Novelty52%
AI Score51

10 Papers

SYMar 22, 2023
Resilient Output Containment Control of Heterogeneous Multiagent Systems Against Composite Attacks: A Digital Twin Approach

Yukang Cui, Lingbo Cao, Michael V. Basin et al.

This paper studies the distributed resilient output containment control of heterogeneous multiagent systems against composite attacks, including denial-of-services (DoS) attacks, false-data injection (FDI) attacks, camouflage attacks, and actuation attacks. Inspired by digital twins, a twin layer (TL) with higher security and privacy is used to decouple the above problem into two tasks: defense protocols against DoS attacks on TL and defense protocols against actuation attacks on cyber-physical layer (CPL). First, considering modeling errors of leader dynamics, we introduce distributed observers to reconstruct the leader dynamics for each follower on TL under DoS attacks. Second, distributed estimators are used to estimate follower states according to the reconstructed leader dynamics on the TL. Third, according to the reconstructed leader dynamics, we design decentralized solvers that calculate the output regulator equations on CPL. Fourth, decentralized adaptive attack-resilient control schemes that resist unbounded actuation attacks are provided on CPL. Furthermore, we apply the above control protocols to prove that the followers can achieve uniformly ultimately bounded (UUB) convergence, and the upper bound of the UUB convergence is determined explicitly. Finally, two simulation examples are provided to show the effectiveness of the proposed control protocols.

SYMar 22, 2023
Data-Driven Leader-following Consensus for Nonlinear Multi-Agent Systems against Composite Attacks: A Twins Layer Approach

Xin Gong, Jintao Peng, Dong Yang et al.

This paper studies the leader-following consensuses of uncertain and nonlinear multi-agent systems against composite attacks (CAs), including Denial of Service (DoS) attacks and actuation attacks (AAs). A double-layer control framework is formulated, where a digital twin layer (TL) is added beside the traditional cyber-physical layer (CPL), inspired by the recent Digital Twin technology. Consequently, the resilient control task against CAs can be divided into two parts: One is distributed estimation against DoS attacks on the TL and the other is resilient decentralized tracking control against actuation attacks on the CPL. %The data-driven scheme is used to deal with both model non-linearity and model uncertainty, in which only the input and output data of the system are employed throughout the whole control process. First, a distributed observer based on switching estimation law against DoS is designed on TL. Second, a distributed model free adaptive control (DMFAC) protocol based on attack compensation against AAs is designed on CPL. Moreover, the uniformly ultimately bounded convergence of consensus error of the proposed double-layer DMFAC algorithm is strictly proved. Finally, the simulation verifies the effectiveness of the resilient double-layer control scheme.

88.4CVMar 18
S-VAM: Shortcut Video-Action Model by Self-Distilling Geometric and Semantic Foresight

Haodong Yan, Zhide Zhong, Jiaguan Zhu et al.

Video action models (VAMs) have emerged as a promising paradigm for robot learning, owing to their powerful visual foresight for complex manipulation tasks. However, current VAMs, typically relying on either slow multi-step video generation or noisy one-step feature extraction, cannot simultaneously guarantee real-time inference and high-fidelity foresight. To address this limitation, we propose S-VAM, a shortcut video-action model that foresees coherent geometric and semantic representations via a single forward pass. Serving as a stable blueprint, these foreseen representations significantly simplify the action prediction. To enable this efficient shortcut, we introduce a novel self-distillation strategy that condenses structured generative priors of multi-step denoising into one-step inference. Specifically, vision foundation model (VFM) representations extracted from the diffusion model's own multi-step generated videos provide teacher targets. Lightweight decouplers, as students, learn to directly map noisy one-step features to these targets. Extensive experiments in simulation and the real world demonstrate that our S-VAM outperforms state-of-the-art methods, enabling efficient and precise manipulation in complex environments. Our project page is https://haodong-yan.github.io/S-VAM/

SYMar 22, 2023
Resilient Output Consensus Control of Heterogeneous Multi-agent Systems against Byzantine Attacks: A Twin Layer Approach

Xin Gong, Yiwen Liang, Yukang Cui et al.

This paper studies the problem of cooperative control of heterogeneous multi-agent systems (MASs) against Byzantine attacks. The agent affected by Byzantine attacks sends different wrong values to all neighbors while applying wrong input signals for itself, which is aggressive and difficult to be defended. Inspired by the concept of Digital Twin, a new hierarchical protocol equipped with a virtual twin layer (TL) is proposed, which decouples the above problems into the defense scheme against Byzantine edge attacks on the TL and the defense scheme against Byzantine node attacks on the cyber-physical layer (CPL). On the TL, we propose a resilient topology reconfiguration strategy by adding a minimum number of key edges to improve network resilience. It is strictly proved that the control strategy is sufficient to achieve asymptotic consensus in finite time with the topology on the TL satisfying strongly $(2f+1)$-robustness. On the CPL, decentralized chattering-free controllers are proposed to guarantee the resilient output consensus for the heterogeneous MASs against Byzantine node attacks. Moreover, the obtained controller shows exponential convergence. The effectiveness and practicality of the theoretical results are verified by numerical examples.

99.0CVMar 23
DualCoT-VLA: Visual-Linguistic Chain of Thought via Parallel Reasoning for Vision-Language-Action Models

Zhide Zhong, Junfeng Li, Junjie He et al.

Vision-Language-Action (VLA) models map visual observations and language instructions directly to robotic actions. While effective for simple tasks, standard VLA models often struggle with complex, multi-step tasks requiring logical planning, as well as precise manipulations demanding fine-grained spatial perception. Recent efforts have incorporated Chain-of-Thought (CoT) reasoning to endow VLA models with a ``thinking before acting'' capability. However, current CoT-based VLA models face two critical limitations: 1) an inability to simultaneously capture low-level visual details and high-level logical planning due to their reliance on isolated, single-modal CoT; 2) high inference latency with compounding errors caused by step-by-step autoregressive decoding. To address these limitations, we propose DualCoT-VLA, a visual-linguistic CoT method for VLA models with a parallel reasoning mechanism. To achieve comprehensive multi-modal reasoning, our method integrates a visual CoT for low-level spatial understanding and a linguistic CoT for high-level task planning. Furthermore, to overcome the latency bottleneck, we introduce a parallel CoT mechanism that incorporates two sets of learnable query tokens, shifting autoregressive reasoning to single-step forward reasoning. Extensive experiments demonstrate that our DualCoT-VLA achieves state-of-the-art performance on the LIBERO and RoboCasa GR1 benchmarks, as well as in real-world platforms.

IVSep 10, 2024
Unrevealed Threats: A Comprehensive Study of the Adversarial Robustness of Underwater Image Enhancement Models

Siyu Zhai, Zhibo He, Xiaofeng Cong et al.

Learning-based methods for underwater image enhancement (UWIE) have undergone extensive exploration. However, learning-based models are usually vulnerable to adversarial examples so as the UWIE models. To the best of our knowledge, there is no comprehensive study on the adversarial robustness of UWIE models, which indicates that UWIE models are potentially under the threat of adversarial attacks. In this paper, we propose a general adversarial attack protocol. We make a first attempt to conduct adversarial attacks on five well-designed UWIE models on three common underwater image benchmark datasets. Considering the scattering and absorption of light in the underwater environment, there exists a strong correlation between color correction and underwater image enhancement. On the basis of that, we also design two effective UWIE-oriented adversarial attack methods Pixel Attack and Color Shift Attack targeting different color spaces. The results show that five models exhibit varying degrees of vulnerability to adversarial attacks and well-designed small perturbations on degraded images are capable of preventing UWIE models from generating enhanced results. Further, we conduct adversarial training on these models and successfully mitigated the effectiveness of adversarial attacks. In summary, we reveal the adversarial vulnerability of UWIE models and propose a new evaluation dimension of UWIE models.

CVApr 22, 2024Code
Deep Learning-Based Point Cloud Registration: A Comprehensive Survey and Taxonomy

Yu-Xin Zhang, Jie Gui, Baosheng Yu et al.

Point cloud registration involves determining a rigid transformation to align a source point cloud with a target point cloud. This alignment is fundamental in applications such as autonomous driving, robotics, and medical imaging, where precise spatial correspondence is essential. Deep learning has greatly advanced point cloud registration by providing robust and efficient methods that address the limitations of traditional approaches, including sensitivity to noise, outliers, and initialization. However, a well-constructed taxonomy for these methods is still lacking, making it difficult to systematically classify and compare the various approaches. In this paper, we present a comprehensive survey and taxonomy on deep learning-based point cloud registration (DL-PCR). We begin with a formal description of the point cloud registration problem, followed by an overview of the datasets, evaluation metrics, and loss functions commonly used in DL-PCR. Next, we categorize existing DL-PCR methods into supervised and unsupervised approaches, as they focus on significantly different key aspects. For supervised DL-PCR methods, we organize the discussion based on key aspects, including the registration procedure, optimization strategy, learning paradigm, network enhancement, and integration with traditional methods; For unsupervised DL-PCR methods, we classify them into correspondence-based and correspondence-free approaches, depending on whether they require explicit identification of point-to-point correspondences. To facilitate a more comprehensive and fair comparison, we conduct quantitative evaluations of all recent state-of-the-art approaches, using a unified training setting and consistent data partitioning strategy. Lastly, we highlight the open challenges and discuss potential directions for future study. A comprehensive collection is available at https://github.com/yxzhang15/PCR.

CVDec 1, 2025
Open-world Hand-Object Interaction Video Generation Based on Structure and Contact-aware Representation

Haodong Yan, Hang Yu, Zhide Zhong et al.

Generating realistic hand-object interactions (HOI) videos is a significant challenge due to the difficulty of modeling physical constraints (e.g., contact and occlusion between hands and manipulated objects). Current methods utilize HOI representation as an auxiliary generative objective to guide video synthesis. However, there is a dilemma between 2D and 3D representations that cannot simultaneously guarantee scalability and interaction fidelity. To address this limitation, we propose a structure and contact-aware representation that captures hand-object contact, hand-object occlusion, and holistic structure context without 3D annotations. This interaction-oriented and scalable supervision signal enables the model to learn fine-grained interaction physics and generalize to open-world scenarios. To fully exploit the proposed representation, we introduce a joint-generation paradigm with a share-and-specialization strategy that generates interaction-oriented representations and videos. Extensive experiments demonstrate that our method outperforms state-of-the-art methods on two real-world datasets in generating physics-realistic and temporally coherent HOI videos. Furthermore, our approach exhibits strong generalization to challenging open-world scenarios, highlighting the benefit of our scalable design. Our project page is https://hgzn258.github.io/SCAR/.

67.6CVMay 11
OZ-TAL: Online Zero-Shot Temporal Action Localization

Chaolei Han, Hongsong Wang, Xin Gong et al.

Online Temporal Action Localization (On-TAL) aims to detect the occurrence time and category of actions in untrimmed streaming videos immediately upon their completion. Recent advancements in this field focus on developing more sophisticated frameworks, shifting from Online Action Detection (OAD)-based aggregation paradigm to instance-level understanding. However, existing approaches are typically trained on specific domains and often exhibit limited generalization capabilities when applied to arbitrary videos, particularly in the presence of previously unseen actions. In this paper, we introduce a new task called Online Zero-shot Temporal Action Localization (OZ-TAL), which aims to detect previously unseen actions in an online fashion. Furthermore, we propose a training-free framework that leverages off-the-shelf Vision-Language Models (VLMs) while introducing additional mechanisms to enhance visual representations and mitigate their inherent biases. We establish new benchmarks and representative baselines for OZ-TAL on THUMOS14 and ActivityNet-1.3, and extensive experiments demonstrate that our method substantially outperforms existing state-of-the-art approaches under both offline and online zero-shot settings.

ROFeb 23, 2021
Resilient Path Planning of UAVs against Covert Attacks on UWB Sensors

Jiayi He, Xin Gong, Yukang Cui et al.

In this letter, a resilient path planning scheme is proposed to navigate a UAV to the planned (nominal) destination with minimum energy-consumption in the presence of a smart attacker. The UAV is equipped with two sensors, a GPS sensor, which is vulnerable to the spoofing attacker, and a well-functioning Ultra-Wideband (UWB) sensor, which is possible to be fooled. We show that a covert attacker can significantly deviate the UAV's path by simultaneously corrupting the GPS signals and forging control inputs without being detected by the UWB sensor. The prerequisite for the attack occurrence is first discussed. Based on this prerequisite, the optimal attack scheme is proposed, which maximizes the deviation between the nominal destination and the real one. Correspondingly, an energy-efficient and resilient navigation scheme based on Pontryagin's maximum principle \cite{gelfand2000calculus} is formulated, which depresses the above covert attacker effectively. To sum up, this problem can be seen as a Stackelberg game \cite{bacsar1998dynamic} between a secure path planner (defender) and a covert attacker. The effectiveness and practicality of our theoretical results are illustrated via two series of simulation examples and a UAV experiment.