Pai Zheng

RO
h-index33
5papers
65citations
Novelty44%
AI Score45

5 Papers

69.7ROMay 27
World Models for Robotic Manipulation: A Survey

Fangyuan Wang, Ziyuan Wang, Guorui Pei et al.

Robotic manipulation depends on the ability to anticipate how actions reshape objects, contacts, and scene geometry before execution. Learned world models provide this capability by predicting task-relevant future evolution under robot intervention, yet the term now spans latent dynamics models, action-conditioned video generators, three- and four-dimensional scene predictors, physics-informed simulators, and predictive modules inside vision-language-action systems. This breadth has fragmented the literature and obscured the design choices that matter for manipulation. We survey world models for robotic manipulation through three questions: what future representation is predicted, how prediction is connected to action, and when prediction is used in the robot-learning pipeline. We operationally define a world model as an action-conditioned predictive system and distinguish it from perception modules, inverse models, policies, rewards, and value functions. We then organize existing work into five representation families, develop a functional taxonomy that separates integrated prediction-action models from explicit predictive planners, and characterize infrastructure roles including synthetic experience generation, candidate filtering, search-based evaluation, learned environments, and outcome verification. We further map these roles across pretraining, post-training, and inference adaptation, review 34 manipulation datasets, and synthesize evaluation protocols for predictive fidelity, task performance, and simulator reliability. This survey shows that world models are evolving from task-specific dynamics predictors into predictive infrastructure for robot learning, while exposing open challenges in contact modeling, hallucination control, action alignment, and benchmarking under closed-loop use.

CVFeb 20, 2024
Modality-Aware Integration with Large Language Models for Knowledge-based Visual Question Answering

Junnan Dong, Qinggang Zhang, Huachi Zhou et al.

Knowledge-based visual question answering (KVQA) has been extensively studied to answer visual questions with external knowledge, e.g., knowledge graphs (KGs). While several attempts have been proposed to leverage large language models (LLMs) as an implicit knowledge source, it remains challenging since LLMs may generate hallucinations. Moreover, multiple knowledge sources, e.g., images, KGs and LLMs, cannot be readily aligned for complex scenarios. To tackle these, we present a novel modality-aware integration with LLMs for KVQA (MAIL). It carefully leverages multimodal knowledge for both image understanding and knowledge reasoning. Specifically, (i) we propose a two-stage prompting strategy with LLMs to densely embody the image into a scene graph with detailed visual features; (ii) We construct a coupled concept graph by linking the mentioned entities with external facts. (iii) A tailored pseudo-siamese graph medium fusion is designed for sufficient multimodal fusion. We utilize the shared mentioned entities in two graphs as mediums to bridge a tight inter-modal exchange, while maximally preserving insightful intra-modal learning by constraining the fusion within mediums. Extensive experiments on two benchmark datasets show the superiority of MAIL with 24x less resources.

LGJul 7, 2025
Hybrid Adversarial Spectral Loss Conditional Generative Adversarial Networks for Signal Data Augmentation in Ultra-precision Machining Surface Roughness Prediction

Suiyan Shang, Chi Fai Cheung, Pai Zheng

Accurate surface roughness prediction in ultra-precision machining (UPM) is critical for real-time quality control, but small datasets hinder model performance. We propose HAS-CGAN, a Hybrid Adversarial Spectral Loss CGAN, for effective UPM data augmentation. Among five CGAN variants tested, HAS-CGAN excels in 1D force signal generation, particularly for high-frequency signals, achieving >0.85 wavelet coherence through Fourier-domain optimization. By combining generated signals with machining parameters, prediction accuracy significantly improves. Experiments with traditional ML (SVR, RF, LSTM) and deep learning models (BPNN, 1DCNN, CNN-Transformer) demonstrate that augmenting training data with 520+ synthetic samples reduces prediction error from 31.4% (original 52 samples) to ~9%, effectively addressing data scarcity in UPM roughness prediction."

AIJun 3, 2024
Logical Reasoning with Relation Network for Inductive Knowledge Graph Completion

Qinggang Zhang, Keyu Duan, Junnan Dong et al.

Inductive knowledge graph completion (KGC) aims to infer the missing relation for a set of newly-coming entities that never appeared in the training set. Such a setting is more in line with reality, as real-world KGs are constantly evolving and introducing new knowledge. Recent studies have shown promising results using message passing over subgraphs to embed newly-coming entities for inductive KGC. However, the inductive capability of these methods is usually limited by two key issues. (i) KGC always suffers from data sparsity, and the situation is even exacerbated in inductive KGC where new entities often have few or no connections to the original KG. (ii) Cold-start problem. It is over coarse-grained for accurate KG reasoning to generate representations for new entities by gathering the local information from few neighbors. To this end, we propose a novel iNfOmax RelAtion Network, namely NORAN, for inductive KG completion. It aims to mine latent relation patterns for inductive KG completion. Specifically, by centering on relations, NORAN provides a hyper view towards KG modeling, where the correlations between relations can be naturally captured as entity-independent logical evidence to conduct inductive KGC. Extensive experiment results on five benchmarks show that our framework substantially outperforms the state-of-the-art KGC methods.

ROMar 8, 2019
Performance evaluation of a foot-controlled human-robot interface

Yanpei Huang, Etienne Burdet, Lin Cao et al.

Robotic minimally invasive interventions typically require using more than two instruments. We thus developed a foot pedal interface which allows the user to control a robotic arm (simultaneously to working with the hands) with four degrees of freedom in continuous directions and speeds. This paper evaluates and compares the performances of ten naive operators in using this new pedal interface and a traditional button interface in completing tasks. These tasks are geometrically complex path-following tasks similar to those in laparoscopic training, and the traditional button interface allows axis-by-axis control with constant speeds. Precision, time, and smoothness of the subjects' control movements for these tasks are analysed. The results demonstrate that the pedal interface can be used to control a robot for complex motion tasks. The subjects kept the average error rate at a low level of around 2.6% with both interfaces, but the pedal interface resulted in about 30% faster operation speed and 60% smoother movement, which indicates improved efficiency and user experience as compared with the button interface. The results of a questionnaire show that the operators found that controlling the robot with the pedal interface was more intuitive, comfortable, and less tiring than using the button interface.