Nan Xiao

CV
h-index13
8papers
53citations
Novelty52%
AI Score47

8 Papers

99.7ROApr 22
Open-H-Embodiment: A Large-Scale Dataset for Enabling Foundation Models in Medical Robotics

Open-H-Embodiment Consortium, Nigel Nelson, Juo-Tung Chen et al.

Autonomous medical robots hold promise to improve patient outcomes, reduce provider workload, democratize access to care, and enable superhuman precision. However, autonomous medical robotics has been limited by a fundamental data problem: existing medical robotic datasets are small, single-embodiment, and rarely shared openly, restricting the development of foundation models that the field needs to advance. We introduce Open-H-Embodiment, the largest open dataset of medical robotic video with synchronized kinematics to date, spanning more than 49 institutions and multiple robotic platforms including the CMR Versius, Intuitive Surgical's da Vinci, da Vinci Research Kit (dVRK), Rob Surgical BiTrack, Virtual Incision's MIRA, Moon Surgical Maestro, and a variety of custom systems, spanning surgical manipulation, robotic ultrasound, and endoscopy procedures. We demonstrate the research enabled by this dataset through two foundation models. GR00T-H is the first open foundation vision-language-action model for medical robotics, which is the only evaluated model to achieve full end-to-end task completion on a structured suturing benchmark (25% of trials vs. 0% for all others) and achieves 64% average success across a 29-step ex vivo suturing sequence. We also train Cosmos-H-Surgical-Simulator, the first action-conditioned world model to enable multi-embodiment surgical simulation from a single checkpoint, spanning nine robotic platforms and supporting in silico policy evaluation and synthetic data generation for the medical domain. These results suggest that open, large-scale medical robot data collection can serve as critical infrastructure for the research community, enabling advances in robot learning, world modeling, and beyond.

51.7CVMar 26
arg-VU: Affordance Reasoning with Physics-Aware 3D Geometry for Visual Understanding in Robotic Surgery

Nan Xiao, Yunxin Fan, Farong Wang et al.

Affordance reasoning provides a principled link between perception and action, yet remains underexplored in surgical robotics, where tissues are highly deformable, compliant, and dynamically coupled with tool motion. We present arg-VU, a physics-aware affordance reasoning framework that integrates temporally consistent geometry tracking with constraint-induced mechanical modeling for surgical visual understanding. Surgical scenes are reconstructed using 3D Gaussian Splatting (3DGS) and converted into a temporally tracked surface representation. Extended Position-Based Dynamics (XPBD) embeds local deformation constraints and produces representative geometry points (RGPs) whose constraint sensitivities define anisotropic stiffness metrics capturing the local constraint-manifold geometry. Robotic tool poses in SE(3) are incorporated to compute rigidly induced displacements at RGPs, from which we derive two complementary measures: a physics-aware compliance energy that evaluates mechanical feasibility with respect to local deformation constraints, and a positional agreement score that captures motion alignment (as kinematic motion baseline). Experiments on surgical video datasets show that arg-VU yields more stable, physically consistent, and interpretable affordance predictions than kinematic baselines. These results demonstrate that physics-aware geometric representations enable reliable affordance reasoning for deformable surgical environments and support embodied robotic interaction.

CVDec 16, 2025
CIS-BA: Continuous Interaction Space Based Backdoor Attack for Object Detection in the Real-World

Shuxin Zhao, Bo Lang, Nan Xiao et al.

Object detection models deployed in real-world applications such as autonomous driving face serious threats from backdoor attacks. Despite their practical effectiveness,existing methods are inherently limited in both capability and robustness due to their dependence on single-trigger-single-object mappings and fragile pixel-level cues. We propose CIS-BA, a novel backdoor attack paradigm that redefines trigger design by shifting from static object features to continuous inter-object interaction patterns that describe how objects co-occur and interact in a scene. By modeling these patterns as a continuous interaction space, CIS-BA introduces space triggers that, for the first time, enable a multi-trigger-multi-object attack mechanism while achieving robustness through invariant geometric relations. To implement this paradigm, we design CIS-Frame, which constructs space triggers via interaction analysis, formalizes them as class-geometry constraints for sample poisoning, and embeds the backdoor during detector training. CIS-Frame supports both single-object attacks (object misclassification and disappearance) and multi-object simultaneous attacks, enabling complex and coordinated effects across diverse interaction states. Experiments on MS-COCO and real-world videos show that CIS-BA achieves over 97% attack success under complex environments and maintains over 95% effectiveness under dynamic multi-trigger conditions, while evading three state-of-the-art defenses. In summary, CIS-BA extends the landscape of backdoor attacks in interaction-intensive scenarios and provides new insights into the security of object detection systems.

IRFeb 20, 2025
External Large Foundation Model: How to Efficiently Serve Trillions of Parameters for Online Ads Recommendation

Mingfu Liang, Xi Liu, Rong Jin et al.

Ads recommendation is a prominent service of online advertising systems and has been actively studied. Recent studies indicate that scaling-up and advanced design of the recommendation model can bring significant performance improvement. However, with a larger model scale, such prior studies have a significantly increasing gap from industry as they often neglect two fundamental challenges in industrial-scale applications. First, training and inference budgets are restricted for the model to be served, exceeding which may incur latency and impair user experience. Second, large-volume data arrive in a streaming mode with data distributions dynamically shifting, as new users/ads join and existing users/ads leave the system. We propose the External Large Foundation Model (ExFM) framework to address the overlooked challenges. Specifically, we develop external distillation and a data augmentation system (DAS) to control the computational cost of training/inference while maintaining high performance. We design the teacher in a way like a foundation model (FM) that can serve multiple students as vertical models (VMs) to amortize its building cost. We propose Auxiliary Head and Student Adapter to mitigate the data distribution gap between FM and VMs caused by the streaming data issue. Comprehensive experiments on internal industrial-scale applications and public datasets demonstrate significant performance gain by ExFM.

CRFeb 20, 2024
APT-MMF: An advanced persistent threat actor attribution method based on multimodal and multilevel feature fusion

Nan Xiao, Bo Lang, Ting Wang et al.

Threat actor attribution is a crucial defense strategy for combating advanced persistent threats (APTs). Cyber threat intelligence (CTI), which involves analyzing multisource heterogeneous data from APTs, plays an important role in APT actor attribution. The current attribution methods extract features from different CTI perspectives and employ machine learning models to classify CTI reports according to their threat actors. However, these methods usually extract only one kind of feature and ignore heterogeneous information, especially the attributes and relations of indicators of compromise (IOCs), which form the core of CTI. To address these problems, we propose an APT actor attribution method based on multimodal and multilevel feature fusion (APT-MMF). First, we leverage a heterogeneous attributed graph to characterize APT reports and their IOC information. Then, we extract and fuse multimodal features, including attribute type features, natural language text features and topological relationship features, to construct comprehensive node representations. Furthermore, we design multilevel heterogeneous graph attention networks to learn the deep hidden features of APT report nodes; these networks integrate IOC type-level, metapath-based neighbor node-level, and metapath semantic-level attention. Utilizing multisource threat intelligence, we construct a heterogeneous attributed graph dataset for verification purposes. The experimental results show that our method not only outperforms the existing methods but also demonstrates its good interpretability for attribution analysis tasks.

CVNov 7, 2024
A Reinforcement Learning-Based Automatic Video Editing Method Using Pre-trained Vision-Language Model

Panwen Hu, Nan Xiao, Feifei Li et al.

In this era of videos, automatic video editing techniques attract more and more attention from industry and academia since they can reduce workloads and lower the requirements for human editors. Existing automatic editing systems are mainly scene- or event-specific, e.g., soccer game broadcasting, yet the automatic systems for general editing, e.g., movie or vlog editing which covers various scenes and events, were rarely studied before, and converting the event-driven editing method to a general scene is nontrivial. In this paper, we propose a two-stage scheme for general editing. Firstly, unlike previous works that extract scene-specific features, we leverage the pre-trained Vision-Language Model (VLM) to extract the editing-relevant representations as editing context. Moreover, to close the gap between the professional-looking videos and the automatic productions generated with simple guidelines, we propose a Reinforcement Learning (RL)-based editing framework to formulate the editing problem and train the virtual editor to make better sequential editing decisions. Finally, we evaluate the proposed method on a more general editing task with a real movie dataset. Experimental results demonstrate the effectiveness and benefits of the proposed context representation and the learning ability of our RL-based editing framework.

SEJan 29, 2020
The Rockerverse: Packages and Applications for Containerization with R

Daniel Nüst, Dirk Eddelbuettel, Dom Bennett et al.

The Rocker Project provides widely used Docker images for R across different application scenarios. This article surveys downstream projects that build upon the Rocker Project images and presents the current state of R packages for managing Docker images and controlling containers. These use cases cover diverse topics such as package development, reproducible research, collaborative work, cloud-based data processing, and production deployment of services. The variety of applications demonstrates the power of the Rocker Project specifically and containerisation in general. Across the diverse ways to use containers, we identified common themes: reproducible environments, scalability and efficiency, and portability across clouds. We conclude that the current growth and diversification of use cases is likely to continue its positive impact, but see the need for consolidating the Rockerverse ecosystem of packages, developing common practices for applications, and exploring alternative containerisation software.

OCSep 16, 2015
Mean Square Capacity of Power Constrained Fading Channels with Causal Encoders and Decoders

Liang Xu, Lihua Xie, Nan Xiao

This paper is concerned with the mean square stabilization problem of discrete-time LTI systems over a power constrained fading channel. Different from existing research works, the channel considered in this paper suffers from both fading and additive noises. We allow any form of causal channel encoders/decoders, unlike linear encoders/decoders commonly studied in the literature. Sufficient conditions and necessary conditions for the mean square stabilizability are given in terms of channel parameters such as transmission power and fading and additive noise statistics in relation to the unstable eigenvalues of the open-loop system matrix. The corresponding mean square capacity of the power constrained fading channel under causal encoders/decoders is given. It is proved that this mean square capacity is smaller than the corresponding Shannon channel capacity. In the end, numerical examples are presented, which demonstrate that the causal encoders/decoders render less restrictive stabilizability conditions than those under linear encoders/decoders studied in the existing works.