Shoujie Li

RO
h-index10
10papers
82citations
Novelty57%
AI Score54

10 Papers

RONov 30, 2022
Visual-tactile Fusion for Transparent Object Grasping in Complex Backgrounds

Shoujie Li, Haixin Yu, Wenbo Ding et al.

The accurate detection and grasping of transparent objects are challenging but of significance to robots. Here, a visual-tactile fusion framework for transparent object grasping under complex backgrounds and variant light conditions is proposed, including the grasping position detection, tactile calibration, and visual-tactile fusion based classification. First, a multi-scene synthetic grasping dataset generation method with a Gaussian distribution based data annotation is proposed. Besides, a novel grasping network named TGCNN is proposed for grasping position detection, showing good results in both synthetic and real scenes. In tactile calibration, inspired by human grasping, a fully convolutional network based tactile feature extraction method and a central location based adaptive grasping strategy are designed, improving the success rate by 36.7% compared to direct grasping. Furthermore, a visual-tactile fusion method is proposed for transparent objects classification, which improves the classification accuracy by 34%. The proposed framework synergizes the advantages of vision and touch, and greatly improves the grasping efficiency of transparent objects.

LGDec 14, 2022
Safety Correction from Baseline: Towards the Risk-aware Policy in Robotics via Dual-agent Reinforcement Learning

Linrui Zhang, Zichen Yan, Li Shen et al.

Learning a risk-aware policy is essential but rather challenging in unstructured robotic tasks. Safe reinforcement learning methods open up new possibilities to tackle this problem. However, the conservative policy updates make it intractable to achieve sufficient exploration and desirable performance in complex, sample-expensive environments. In this paper, we propose a dual-agent safe reinforcement learning strategy consisting of a baseline and a safe agent. Such a decoupled framework enables high flexibility, data efficiency and risk-awareness for RL-based control. Concretely, the baseline agent is responsible for maximizing rewards under standard RL settings. Thus, it is compatible with off-the-shelf training techniques of unconstrained optimization, exploration and exploitation. On the other hand, the safe agent mimics the baseline agent for policy improvement and learns to fulfill safety constraints via off-policy RL tuning. In contrast to training from scratch, safe policy correction requires significantly fewer interactions to obtain a near-optimal policy. The dual policies can be optimized synchronously via a shared replay buffer, or leveraging the pre-trained model or the non-learning-based controller as a fixed baseline agent. Experimental results show that our approach can learn feasible skills without prior knowledge as well as deriving risk-averse counterparts from pre-trained unsafe policies. The proposed method outperforms the state-of-the-art safe RL algorithms on difficult robot locomotion and manipulation tasks with respect to both safety constraint satisfaction and sample efficiency.

ROApr 18
Time-Division Multiplexing Actuation in Tendon-Driven Arms: Lightweight Design and Fault Tolerance

Shoujie Li, Changqing Guo, Jianle Xu et al.

Robotic manipulators for aerospace applications require a delicate balance between lightweight construction and fault-tolerant operation to satisfy strict weight limitations and ensure reliability in remote, hazardous environments. This paper presents Time-Division Multiplexing Actuation (TDMA), a practical approach for tendon-driven robots that significantly reduces actuator count while preserving high torque output and intrinsic fault tolerance. The key hardware employs a vertically-stacked rotational selection structure that integrates self-rotating TDM motors for rapid configuration, electromagnetic clutches enabling sub-0.1 second engagement, a worm gear reducer for enhanced load capacity and self-locking capability, and a dual-encoder system for precise, long-term positioning. Leveraging TDMA, the proposed MuxArm achieves a self-weight of 2.17 kg, supports an actuator driving capacity of 10 kg, and maintains end-effector accuracy up to 1% of its length, even under partial servo failure. Additionally, an actuation space trajectory planning algorithm is developed, enabling fault-tolerant control and reducing tendon load by up to 50% compared to conventional methods. Comprehensive experiments demonstrate MuxArm's robust performance in diverse settings, including free-space, cluttered, and confined environments.

ROAug 27, 2024
Depth Restoration of Hand-Held Transparent Objects for Human-to-Robot Handover

Ran Yu, Haixin Yu, Shoujie Li et al.

Transparent objects are common in daily life, while their optical properties pose challenges for RGB-D cameras to capture accurate depth information. This issue is further amplified when these objects are hand-held, as hand occlusions further complicate depth estimation. For assistant robots, however, accurately perceiving hand-held transparent objects is critical to effective human-robot interaction. This paper presents a Hand-Aware Depth Restoration (HADR) method based on creating an implicit neural representation function from a single RGB-D image. The proposed method utilizes hand posture as an important guidance to leverage semantic and geometric information of hand-object interaction. To train and evaluate the proposed method, we create a high-fidelity synthetic dataset named TransHand-14K with a real-to-sim data generation scheme. Experiments show that our method has better performance and generalization ability compared with existing methods. We further develop a real-world human-to-robot handover system based on HADR, demonstrating its potential in human-robot interaction applications.

ROMar 28
FlexiCup: Wireless Multimodal Suction Cup with Dual-Zone Vision-Tactile Sensing

Junhao Gong, Shoujie Li, Kit-Wa Sou et al.

Conventional suction cups lack sensing capabilities for contact-aware manipulation in unstructured environments. This paper presents FlexiCup, a multimodal suction cup with wireless electronics that integrate dual-zone vision-tactile sensing. The central zone dynamically switches between vision and tactile modalities via illumination control, while the peripheral zone provides continuous spatial awareness. The modular mechanical design supports both vacuum (sustained-contact adhesion) and Bernoulli (contactless lifting) actuation while maintaining the identical dual-zone sensing architecture, demonstrating sensing-actuation decoupling where sensing and actuation principles are orthogonally separable. We validate hardware versatility through dual control paradigms. Modular perception-driven grasping achieves comparable success rates across vacuum (90.0%) and Bernoulli (86.7%) modes using identical sensing and control pipelines, validating the sensing architecture's effectiveness across fundamentally different pneumatic principles. Diffusion-based end-to-end learning achieves 73.3% and 66.7% success on contact-aware manipulation tasks, with ablation studies confirming 13% improvements from multi-head attention coordinating dual-zone observations. Hardware designs, firmware, and experimental videos are available at the companion website: https://flexicup.junhaogong.top.

ROMar 16
Master Micro Residual Correction with Adaptive Tactile Fusion and Force-Mixed Control for Contact-Rich Manipulation

Xingting Li, Yifan Xie, Han Liu et al.

Robotic contact-rich and fine-grained manipulation remains a significant challenge due to complex interaction dynamics and the competing requirements of multi-timescale control. While current visual imitation learning methods excel at long-horizon planning, they often fail to perceive critical interaction cues like friction variations or incipient slip, and struggle to balance global task coherence with local reactive feedback. To address these challenges, we propose M2-ResiPolicy, a novel Master-Micro residual control architecture that synergizes high-level action guidance with low-level correction. The framework consists of a Master-Guidance Policy (MGP) operating at 10 Hz, which generates temporally consistent action chunks via a diffusion-based backbone and employs a tactile-intensity-driven adaptive fusion mechanism to dynamically modulate perceptual weights between vision and touch. Simultaneously, a high-frequency (60 Hz) Micro-Residual Corrector (MRC) utilizes a lightweight GRU to provide real-time action compensation based on TCP wrench feedback. This policy is further integrated with a force-mixed PBIC execution layer, effectively regulating contact forces to ensure interaction safety. Experiments across several demanding tasks including fragile object grasping and precision insertion, demonstrate that M2-ResiPolicy significantly outperforms standard Diffusion Policy (DP) and state-of-the-art Reactive Diffusion Policy (RDP), achieving a 93\% damage-free success rate in chip grasping and superior force regulation stability.

ROFeb 5
FUTURE-VLA: Forecasting Unified Trajectories Under Real-time Execution

Jingjing Fan, Yushan Liu, Shoujie Li et al.

General vision-language models increasingly support unified spatiotemporal reasoning over long video streams, yet deploying such capabilities on robots remains constrained by the prohibitive latency of processing long-horizon histories and generating high-dimensional future predictions. To bridge this gap, we present FUTURE-VLA, a unified architecture that reformulates long-horizon control and future forecasting as a monolithic sequence-generation task. Adopting a dual-sided efficiency paradigm, FUTURE-VLA leverages a temporally adaptive compression strategy to maximize spatiotemporal information density, enabling the ingestion of extensive multi-view histories while maintaining constant inference latency. Simultaneously, it performs latent-space autoregression to align actionable dynamics with reviewable visual look-aheads in a single forward pass. These real-time predictive capabilities further enable a prediction-guided Human-In-the-Loop mechanism via interactive execution gating, allowing operators to dynamically validate behaviors based on interpretable future previews. Extensive evaluations demonstrate that FUTURE-VLA establishes new state-of-the-art performance, attaining success rates of 99.2% on LIBERO, 75.4% on RoboTwin, and 78.0% on a real-world Piper platform, all with a $16\times$ extended spatiotemporal window while maintaining the inference latency of a single-frame baseline.

ROMay 7
OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation

Yushan Liu, Peibo Sun, Shoujie Li et al.

World Action Models (WAMs) enhance Vision-Language-Action policies by jointly predicting scene evolution and robot actions, but existing methods usually represent the predicted world as holistic images, video tokens, or global latents. These representations are difficult for an action decoder to address when an instruction refers to a particular object, especially under scene shifts where object identity is entangled with context. We propose OA-WAM, an Object-Addressable World Action Model for robust robot manipulation. OA-WAM decomposes each frame into N+1 slot states, with one robot slot and N object slots. Each slot contains a persistent address vector and a time-varying content vector, and is fused with text, image, proprioception, and past-action tokens in a block-causal sequence. A world head predicts next-frame slot states, while a flow-matching action head decodes a 16-step continuous action chunk in the same forward pass. Addressability is enforced by routing cross-slot attention through address-only keys and resetting the address slice at every transformer layer, separating which object to act on from what that object currently is without adding extra tokens. OA-WAM matches strong VLA and WAM baselines on LIBERO (97.8%) and SimplerEnv (79.3%), reaches state-of-the-art performance on the most relevant LIBERO-Plus geometric axes, and remains competitive on the seven-axis aggregate. A causal slot-intervention test yields a swap-binding cosine of 0.87, versus at most 0.09 for holistic baselines. These results suggest that addressable object states provide an effective interface for robust world-action modeling under scene perturbations.

CVMay 28, 2025
Universal Visuo-Tactile Video Understanding for Embodied Interaction

Yifan Xie, Mingyang Li, Shoujie Li et al.

Tactile perception is essential for embodied agents to understand physical attributes of objects that cannot be determined through visual inspection alone. While existing approaches have made progress in visual and language modalities for physical understanding, they fail to effectively incorporate tactile information that provides crucial haptic feedback for real-world interaction. In this paper, we present VTV-LLM, the first multi-modal large language model for universal Visuo-Tactile Video (VTV) understanding that bridges the gap between tactile perception and natural language. To address the challenges of cross-sensor and cross-modal integration, we contribute VTV150K, a comprehensive dataset comprising 150,000 video frames from 100 diverse objects captured across three different tactile sensors (GelSight Mini, DIGIT, and Tac3D), annotated with four fundamental tactile attributes (hardness, protrusion, elasticity, and friction). We develop a novel three-stage training paradigm that includes VTV enhancement for robust visuo-tactile representation, VTV-text alignment for cross-modal correspondence, and text prompt finetuning for natural language generation. Our framework enables sophisticated tactile reasoning capabilities including feature assessment, comparative analysis, scenario-based decision making and so on. Experimental evaluations demonstrate that VTV-LLM achieves superior performance in tactile video understanding tasks, establishing a foundation for more intuitive human-machine interaction in tactile domains.

ROJan 24, 2024
Growing from Exploration: A self-exploring framework for robots based on foundation models

Shoujie Li, Ran Yu, Tong Wu et al.

Intelligent robot is the ultimate goal in the robotics field. Existing works leverage learning-based or optimization-based methods to accomplish human-defined tasks. However, the challenge of enabling robots to explore various environments autonomously remains unresolved. In this work, we propose a framework named GExp, which enables robots to explore and learn autonomously without human intervention. To achieve this goal, we devise modules including self-exploration, knowledge-base-building, and close-loop feedback based on foundation models. Inspired by the way that infants interact with the world, GExp encourages robots to understand and explore the environment with a series of self-generated tasks. During the process of exploration, the robot will acquire skills from beneficial experiences that are useful in the future. GExp provides robots with the ability to solve complex tasks through self-exploration. GExp work is independent of prior interactive knowledge and human intervention, allowing it to adapt directly to different scenarios, unlike previous studies that provided in-context examples as few-shot learning. In addition, we propose a workflow of deploying the real-world robot system with self-learned skills as an embodied assistant.