Yu She

RO
h-index5
20papers
673citations
Novelty50%
AI Score57

20 Papers

RODec 9, 2022
Visuotactile Affordances for Cloth Manipulation with Local Control

Neha Sunil, Shaoxiong Wang, Yu She et al. · stanford

Cloth in the real world is often crumpled, self-occluded, or folded in on itself such that key regions, such as corners, are not directly graspable, making manipulation difficult. We propose a system that leverages visual and tactile perception to unfold the cloth via grasping and sliding on edges. By doing so, the robot is able to grasp two adjacent corners, enabling subsequent manipulation tasks like folding or hanging. As components of this system, we develop tactile perception networks that classify whether an edge is grasped and estimate the pose of the edge. We use the edge classification network to supervise a visuotactile edge grasp affordance network that can grasp edges with a 90% success rate. Once an edge is grasped, we demonstrate that the robot can slide along the cloth to the adjacent corner using tactile pose estimation/control in real time. See http://nehasunil.com/visuotactile/visuotactile.html for videos.

61.3ROMay 31
PLanAR: Planning-Language-Grounded Agentic Reasoning for Robot Manipulation

Pengyuan Guo, Zhonghao Mai, Zhengtong Xu et al.

Recent advances in vision-language models (VLMs) have enabled increasing progress in real-world robot manipulation. However, long-horizon manipulation in unstructured environments requires VLMs to reason about changing scene states, action constraints, and execution outcomes, which remains difficult with natural language reasoning alone. We present PLanAR, a planning-language-grounded robot agent framework for open-vocabulary, long-horizon manipulation. PLanAR uses a planning-language interface to define the VLM reasoning space: object predicates represent scene states, action schemas specify robot skills with preconditions and effects, and symbolic plans provide executable intermediate representations. This interface enables stepwise verification: after each action, PLanAR uses onboard observations to check whether the expected symbolic effects have been achieved, allowing the VLM-based agent to update task states, detect failures, and replan when execution deviates from expectation. Across robot embodiments, VLM backends, and tasks including stacking, crossword solving, and long-horizon kitchen workflows, PLanAR demonstrates strong real-world capability while revealing key limitations of current VLMs in embodied reasoning.

ROMar 7, 2024Code
LeTac-MPC: Learning Model Predictive Control for Tactile-reactive Grasping

Zhengtong Xu, Yu She

Grasping is a crucial task in robotics, necessitating tactile feedback and reactive grasping adjustments for robust grasping of objects under various conditions and with differing physical properties. In this paper, we introduce LeTac-MPC, a learning-based model predictive control (MPC) for tactile-reactive grasping. Our approach enables the gripper to grasp objects with different physical properties on dynamic and force-interactive tasks. We utilize a vision-based tactile sensor, GelSight, which is capable of perceiving high-resolution tactile feedback that contains information on the physical properties and states of the grasped object. LeTac-MPC incorporates a differentiable MPC layer designed to model the embeddings extracted by a neural network (NN) from tactile feedback. This design facilitates convergent and robust grasping control at a frequency of 25 Hz. We propose a fully automated data collection pipeline and collect a dataset only using standardized blocks with different physical properties. However, our trained controller can generalize to daily objects with different sizes, shapes, materials, and textures. The experimental results demonstrate the effectiveness and robustness of the proposed approach. We compare LeTac-MPC with two purely model-based tactile-reactive controllers (MPC and PD) and open-loop grasping. Our results show that LeTac-MPC has optimal performance in dynamic and force-interactive tasks and optimal generalizability. We release our code and dataset at https://github.com/ZhengtongXu/LeTac-MPC.

82.3ROMar 17
CompliantVLA-adaptor: VLM-Guided Variable Impedance Action for Safe Contact-Rich Manipulation

Heng Zhang, Wei-Hsing Huang, Qiyi Tong et al.

We propose a CompliantVLA-adaptor that augments the state-of-the-art Vision-Language-Action (VLA) models with vision-language model (VLM)-informed context-aware variable impedance control (VIC) to improve the safety and effectiveness of contact-rich robotic manipulation tasks. Existing VLA systems (e.g., RDT, Pi0.5, OpenVLA-oft) typically output position, but lack force-aware adaptation, leading to unsafe or failed interactions in physical tasks involving contact, compliance, or uncertainty. In the proposed CompliantVLA-adaptor, a VLM interprets task context from images and natural language to adapt the stiffness and damping parameters of a VIC controller. These parameters are further regulated using real-time force/torque feedback to ensure interaction forces remain within safe thresholds. We demonstrate that our method outperforms the VLA baselines on a suite of complex contact-rich tasks, both in simulation and the real world, with improved success rates and reduced force violations. This work presents a promising path towards a safe foundation model for physical contact-rich manipulation. We release our code, prompts, and force-torque-impedance-scenario context datasets at https://sites.google.com/view/compliantvla.

ROJan 30, 2024Code
LeTO: Learning Constrained Visuomotor Policy with Differentiable Trajectory Optimization

Zhengtong Xu, Yu She

This paper introduces LeTO, a method for learning constrained visuomotor policy with differentiable trajectory optimization. Our approach integrates a differentiable optimization layer into the neural network. By formulating the optimization layer as a trajectory optimization problem, we enable the model to end-to-end generate actions in a safe and constraint-controlled fashion without extra modules. Our method allows for the introduction of constraint information during the training process, thereby balancing the training objectives of satisfying constraints, smoothing the trajectories, and minimizing errors with demonstrations. This ``gray box" method marries optimization-based safety and interpretability with powerful representational abilities of neural networks. We quantitatively evaluate LeTO in simulation and in the real robot. The results demonstrate that LeTO performs well in both simulated and real-world tasks. In addition, it is capable of generating trajectories that are less uncertain, higher quality, and smoother compared to existing imitation learning methods. Therefore, it is shown that LeTO provides a practical example of how to achieve the integration of neural networks with trajectory optimization. We release our code at https://github.com/ZhengtongXu/LeTO.

ROFeb 3, 2025Code
VILP: Imitation Learning with Latent Video Planning

Zhengtong Xu, Qiang Qiu, Yu She

In the era of generative AI, integrating video generation models into robotics opens new possibilities for the general-purpose robot agent. This paper introduces imitation learning with latent video planning (VILP). We propose a latent video diffusion model to generate predictive robot videos that adhere to temporal consistency to a good degree. Our method is able to generate highly time-aligned videos from multiple views, which is crucial for robot policy learning. Our video generation model is highly time-efficient. For example, it can generate videos from two distinct perspectives, each consisting of six frames with a resolution of 96x160 pixels, at a rate of 5 Hz. In the experiments, we demonstrate that VILP outperforms the existing video generation robot policy across several metrics: training costs, inference speed, temporal consistency of generated videos, and the performance of the policy. We also compared our method with other imitation learning methods. Our findings indicate that VILP can rely less on extensive high-quality task-specific robot action data while still maintaining robust performance. In addition, VILP possesses robust capabilities in representing multi-modal action distributions. Our paper provides a practical example of how to effectively integrate video generation models into robot policies, potentially offering insights for related fields and directions. For more details, please refer to our open-source repository https://github.com/ZhengtongXu/VILP.

64.5ROMar 24
EquiBim: Learning Symmetry-Equivariant Policy for Bimanual Manipulation

Zhiyuan Zhang, Aditya Mohan, Seungho Han et al.

Robotic imitation learning has achieved impressive success in learning complex manipulation behaviors from demonstrations. However, many existing robot learning methods do not explicitly account for the physical symmetries of robotic systems, often resulting in asymmetric or inconsistent behaviors under symmetric observations. This limitation is particularly pronounced in dual-arm manipulation, where bilateral symmetry is inherent to both the robot morphology and the structure of many tasks. In this paper, we introduce EquiBim, a symmetry-equivariant policy learning framework for bimanual manipulation that enforces bilateral equivariance between observations and actions during training. Our approach formulates physical symmetry as a group action on both observation and action spaces, and imposes an equivariance constraint on policy predictions under symmetric transformations. The framework is model-agnostic and can be seamlessly integrated into a wide range of imitation learning pipelines with diverse observation modalities and action representations, including point cloud-based and image-based policies, as well as both end-effector-space and joint-space parameterizations. We evaluate EquiBim on RoboTwin, a dual-arm robotic platform with symmetric kinematics, and evaluate it across diverse observation and action configurations in simulation. We further validate the approach on a real-world dual-arm system. Across both simulation and physical experiments, our method consistently improves performance and robustness under distribution shifts. These results suggest that explicitly enforcing physical symmetry provides a simple yet effective inductive bias for bimanual robot learning.

53.5ROApr 1
Learning When to See and When to Feel: Adaptive Vision-Torque Fusion for Contact-Aware Manipulation

Jiuzhou Lei, Chang Liu, Yu She et al.

Vision-based policies have achieved a good performance in robotic manipulation due to the accessibility and richness of visual observations. However, purely visual sensing becomes insufficient in contact-rich and force-sensitive tasks where force/torque (F/T) signals provide critical information about contact dynamics, alignment, and interaction quality. Although various strategies have been proposed to integrate vision and F/T signals, including auxiliary prediction objectives, mixture-of-experts architectures, and contact-aware gating mechanisms, a comparison of these approaches remains lacking. In this work, we provide a comparison study of different F/T-vision integration strategies within diffusion-based manipulation policies. In addition, we propose an adaptive integration strategy that ignores F/T signals during non-contact phases while adaptively leveraging both vision and torque information during contact. Experimental results demonstrate that our method outperforms the strongest baseline by 14% in success rate, highlighting the importance of contact-aware multimodal fusion for robotic manipulation.

ROMay 13, 2025Code
ChicGrasp: Imitation-Learning based Customized Dual-Jaw Gripper Control for Delicate, Irregular Bio-products Manipulation

Amirreza Davar, Zhengtong Xu, Siavash Mahmoudi et al.

Automated poultry processing lines still rely on humans to lift slippery, easily bruised carcasses onto a shackle conveyor. Deformability, anatomical variance, and strict hygiene rules make conventional suction and scripted motions unreliable. We present ChicGrasp, an end--to--end hardware--software co-design for this task. An independently actuated dual-jaw pneumatic gripper clamps both chicken legs, while a conditional diffusion-policy controller, trained from only 50 multi--view teleoperation demonstrations (RGB + proprioception), plans 5 DoF end--effector motion, which includes jaw commands in one shot. On individually presented raw broiler carcasses, our system achieves a 40.6\% grasp--and--lift success rate and completes the pick to shackle cycle in 38 s, whereas state--of--the--art implicit behaviour cloning (IBC) and LSTM-GMM baselines fail entirely. All CAD, code, and datasets will be open-source. ChicGrasp shows that imitation learning can bridge the gap between rigid hardware and variable bio--products, offering a reproducible benchmark and a public dataset for researchers in agricultural engineering and robot learning.

23.1ROMar 10
MuxGel: Simultaneous Dual-Modal Visuo-Tactile Sensing via Spatially Multiplexing and Deep Reconstruction

Zhixian Hu, Zhengtong Xu, Sheeraz Athar et al.

High-fidelity visuo-tactile sensing is important for precise robotic manipulation. However, most vision-based tactile sensors face a fundamental trade-off: opaque coatings enable tactile sensing but block pre-contact vision. To address this, we propose MuxGel, a spatially multiplexed sensor that captures both external visual information and contact-induced tactile signals through a single camera. By using a checkerboard coating pattern, MuxGel interleaves tactile-sensitive regions with transparent windows for external vision. This design maintains standard form factors, allowing for plug-and-play integration into GelSight-style sensors by simply replacing the gel pad. To recover full-resolution vision and tactile signals from the multiplexed inputs, we develop a U-Net-based reconstruction framework. Leveraging a sim-to-real pipeline, our model effectively decouples and restores high-fidelity tactile and visual fields simultaneously. Experiments on unseen objects demonstrate the framework's generalization and accuracy. Furthermore, we demonstrate MuxGel's utility in grasping tasks, where dual-modality feedback facilitates both pre-contact alignment and post-contact interaction. Results show that MuxGel enhances the perceptual capabilities of existing vision-based tactile sensors while maintaining compatibility with their hardware stacks. Project webpage: https://zhixianhu.github.io/muxgel/.

45.7LGMar 10
Reward-Zero: Language Embedding Driven Implicit Reward Mechanisms for Reinforcement Learning

Heng Zhang, Haddy Alchaer, Arash Ajoudani et al.

We introduce Reward-Zero, a general-purpose implicit reward mechanism that transforms natural-language task descriptions into dense, semantically grounded progress signals for reinforcement learning (RL). Reward-Zero serves as a simple yet sophisticated universal reward function that leverages language embeddings for efficient RL training. By comparing the embedding of a task specification with embeddings derived from an agent's interaction experience, Reward-Zero produces a continuous, semantically aligned sense-of-completion signal. This reward supplements sparse or delayed environmental feedback without requiring task-specific engineering. When integrated into standard RL frameworks, it accelerates exploration, stabilizes training, and enhances generalization across diverse tasks. Empirically, agents trained with Reward-Zero converge faster and achieve higher final success rates than conventional methods such as PPO with common reward-shaping baselines, successfully solving tasks that hand-designed rewards could not in some complex tasks. In addition, we develop a mini benchmark for the evaluation of completion sense during task execution via language embeddings. These results highlight the promise of language-driven implicit reward functions as a practical path toward more sample-efficient, generalizable, and scalable RL for embodied agents. Code will be released after peer review.

7.6ROMar 10
Caterpillar-Inspired Spring-Based Compressive Continuum Robot for Bristle-based Exploration

Zhixian Hu, Yu She, Juan Wachs

Exploration of confined spaces, such as pipelines and ducts, remains challenging for conventional rigid robots due to limited space, irregular geometry, and restricted access. Inspired by caterpillar locomotion and sensing, this paper presents a compact spring-based tendon-driven continuum robot that integrates with commercial robotic arms for confined-space inspection. The system combines a mechanically compliant continuum body with a tendon actuation module, enabling coupled bending and axial length change, and uses a constant-curvature kinematic model for positional control. Experiments show a mean position error of 4.32 mm under the proposed model and control pipeline. To extend the system from motion to inspection, we integrate an artificial bristle contact sensor and demonstrate surface perception and confined-space exploration through contact interactions. This compact and compliant design offers a cost-effective upgrade for commercial robots and promises effective exploration in challenging environments.

78.1CVMay 8
Learning Visual Feature-Based World Models via Residual Latent Action

Xinyu Zhang, Zhengtong Xu, Yutian Tao et al.

World models predict future transitions from observations and actions. Existing works predominantly focus on image generation only. Visual feature-based world models, on the other hand, predict future visual features instead of raw video pixels, offering a promising alternative that is more efficient and less prone to hallucination. However, current feature-based approaches rely on direct regression, which leads to blurry or collapsed predictions in complex interactions, while generative modeling in high-dimensional feature spaces still remains challenging. In this work, we discover that a new type of latent action representation, which we refer to as *Residual Latent Action* (RLA), can be easily learned from DINO residuals. We also show that RLA is predictive, generalizable, and encodes temporal progression. Building on RLA, we propose *RLA World Model* (RLA-WM), which predicts RLA values via flow matching. RLA-WM outperforms both state-of-the-art feature-based and video-diffusion world models on simulation and real-world datasets, while being orders of magnitude faster than video diffusion. Furthermore, we develop two robot learning techniques that use RLA-WM to improve policy learning. The first one is a minimalist world action model with RLA that learns from actionless demonstration videos. The second one is the first visual RL framework trained entirely inside a world model learned from offline videos only, using a video-aligned reward and no online interactions or handcrafted rewards. Project page: https://mlzxy.github.io/rla-wm

78.8ROApr 29
Learning Tactile-Aware Quadrupedal Loco-Manipulation Policies

Pokuang Zhou, Yuhao Zhou, Quan Luu et al.

Quadrupedal loco-manipulation is commonly built on visual perception and proprioception. Yet reliable contact-rich manipulation remains difficult: vision and proprioception alone cannot resolve uncertain, evolving interactions with the environment. Tactile sensing offers direct contact observability, but scalable tactile-aware learning framework for quadrupedal loco-manipulation is still underexplored. In this paper, we present a tactile-aware loco-manipulation policy learning pipeline with a hierarchical structure. Our approach has two key components. First, we leverage real-world human demonstrations to train a tactile-conditioned visuotactile high-level policy. This policy predicts not only end-effector trajectories for manipulation, but also the evolving tactile interaction cues that characterize how contact should develop over time. Second, we perform large-scale reinforcement learning in simulation to learn a tactile-aware whole-body control policy that tracks diverse commanded trajectories and tactile interaction cues, and transfers zero-shot to the real world. Together, these components enable coordinated locomotion and manipulation under contact-rich scenarios. We evaluate the system on real-world contact-rich tasks, including in-hand reorientation with insertion, valve tightening, and delicate object manipulation. Compared to vision-only and visuotactile baselines, our method improves performance by 28.54% on average across these tasks.

87.0ROMar 13
TacVLA: Contact-Aware Tactile Fusion for Robust Vision-Language-Action Manipulation

Kaidi Zhang, Heng Zhang, Zhengtong Xu et al.

Vision-Language-Action (VLA) models have demonstrated significant advantages in robotic manipulation. However, their reliance on vision and language often leads to suboptimal performance in tasks involving visual occlusion, fine-grained manipulation, and physical contact. To address these challenges, we propose TacVLA, a fine-tuned VLA model by incorporating tactile modalities into the transformer-based policy to enhance fine-grained manipulation capabilities. Specifically, we introduce a contact-aware gating mechanism that selectively activates tactile tokens only when contact is detected, enabling adaptive multimodal fusion while avoiding irrelevant tactile interference. The fused visual, language, and tactile tokens are jointly processed within the transformer architecture to strengthen cross-modal grounding during contact-rich interaction. Extensive experiments on constraint-locked disassembly, in-box picking and robustness evaluations demonstrate that our model outperforms baselines, improving the performance by averaging 20% success rate in disassembly, 60% in in-box picking and 2.1x improvement in scenarios with visual occlusion. Videos are available at https://sites.google.com/view/tacvla and code will be released.

CVJul 24, 2025
Synthetic Data Augmentation for Enhanced Chicken Carcass Instance Segmentation

Yihong Feng, Chaitanya Pallerla, Xiaomin Lin et al.

The poultry industry has been driven by broiler chicken production and has grown into the world's largest animal protein sector. Automated detection of chicken carcasses on processing lines is vital for quality control, food safety, and operational efficiency in slaughterhouses and poultry processing plants. However, developing robust deep learning models for tasks like instance segmentation in these fast-paced industrial environments is often hampered by the need for laborious acquisition and annotation of large-scale real-world image datasets. We present the first pipeline generating photo-realistic, automatically labeled synthetic images of chicken carcasses. We also introduce a new benchmark dataset containing 300 annotated real-world images, curated specifically for poultry segmentation research. Using these datasets, this study investigates the efficacy of synthetic data and automatic data annotation to enhance the instance segmentation of chicken carcasses, particularly when real annotated data from the processing line is scarce. A small real dataset with varying proportions of synthetic images was evaluated in prominent instance segmentation models. Results show that synthetic data significantly boosts segmentation performance for chicken carcasses across all models. This research underscores the value of synthetic data augmentation as a viable and effective strategy to mitigate data scarcity, reduce manual annotation efforts, and advance the development of robust AI-driven automated detection systems for chicken carcasses in the poultry processing industry.

RODec 13, 2021
Learning Generalizable Vision-Tactile Robotic Grasping Strategy for Deformable Objects via Transformer

Yunhai Han, Kelin Yu, Rahul Batra et al.

Reliable robotic grasping, especially with deformable objects such as fruits, remains a challenging task due to underactuated contact interactions with a gripper, unknown object dynamics and geometries. In this study, we propose a Transformer-based robotic grasping framework for rigid grippers that leverage tactile and visual information for safe object grasping. Specifically, the Transformer models learn physical feature embeddings with sensor feedback through performing two pre-defined explorative actions (pinching and sliding) and predict a grasping outcome through a multilayer perceptron (MLP) with a given grasping strength. Using these predictions, the gripper predicts a safe grasping strength via inference. Compared with convolutional recurrent networks, the Transformer models can capture the long-term dependencies across the image sequences and process spatial-temporal features simultaneously. We first benchmark the Transformer models on a public dataset for slip detection. Following that, we show that the Transformer models outperform a CNN+LSTM model in terms of grasping accuracy and computational efficiency. We also collect a new fruit grasping dataset and conduct online grasping experiments using the proposed framework for both seen and unseen fruits. {In addition, we extend our model to objects with different shapes and demonstrate the effectiveness of our pre-trained model trained on our large-scale fruit dataset. Our codes and dataset are public on GitHub.

ROJun 16, 2021
GelSight Wedge: Measuring High-Resolution 3D Contact Geometry with a Compact Robot Finger

Shaoxiong Wang, Yu She, Branden Romero et al.

Vision-based tactile sensors have the potential to provide important contact geometry to localize the objective with visual occlusion. However, it is challenging to measure high-resolution 3D contact geometry for a compact robot finger, to simultaneously meet optical and mechanical constraints. In this work, we present the GelSight Wedge sensor, which is optimized to have a compact shape for robot fingers, while achieving high-resolution 3D reconstruction. We evaluate the 3D reconstruction under different lighting configurations, and extend the method from 3 lights to 1 or 2 lights. We demonstrate the flexibility of the design by shrinking the sensor to the size of a human finger for fine manipulation tasks. We also show the effectiveness and potential of the reconstructed 3D geometry for pose tracking in the 3D space.

ROOct 3, 2019
Cable Manipulation with a Tactile-Reactive Gripper

Yu She, Shaoxiong Wang, Siyuan Dong et al.

Cables are complex, high dimensional, and dynamic objects. Standard approaches to manipulate them often rely on conservative strategies that involve long series of very slow and incremental deformations, or various mechanical fixtures such as clamps, pins or rings. We are interested in manipulating freely moving cables, in real time, with a pair of robotic grippers, and with no added mechanical constraints. The main contribution of this paper is a perception and control framework that moves in that direction, and uses real-time tactile feedback to accomplish the task of following a dangling cable. The approach relies on a vision-based tactile sensor, GelSight, that estimates the pose of the cable in the grip, and the friction forces during cable sliding. We achieve the behavior by combining two tactile-based controllers: 1) Cable grip controller, where a PD controller combined with a leaky integrator regulates the gripping force to maintain the frictional sliding forces close to a suitable value; and 2) Cable pose controller, where an LQR controller based on a learned linear model of the cable sliding dynamics keeps the cable centered and aligned on the fingertips to prevent the cable from falling from the grip. This behavior is possible by a reactive gripper fitted with GelSight-based high-resolution tactile sensors. The robot can follow one meter of cable in random configurations within 2-3 hand regrasps, adapting to cables of different materials and thicknesses. We demonstrate a robot grasping a headphone cable, sliding the fingers to the jack connector, and inserting it. To the best of our knowledge, this is the first implementation of real-time cable following without the aid of mechanical fixtures.

ROOct 3, 2019
Exoskeleton-covered soft finger with vision-based proprioception and tactile sensing

Yu She, Sandra Q. Liu, Peiyu Yu et al.

Soft robots offer significant advantages in adaptability, safety, and dexterity compared to conventional rigid-body robots. However, it is challenging to equip soft robots with accurate proprioception and tactile sensing due to their high flexibility and elasticity. In this work, we describe the development of a vision-based proprioceptive and tactile sensor for soft robots called GelFlex, which is inspired by previous GelSight sensing techniques. More specifically, we develop a novel exoskeleton-covered soft finger with embedded cameras and deep learning methods that enable high-resolution proprioceptive sensing and rich tactile sensing. To do so, we design features along the axial direction of the finger, which enable high-resolution proprioceptive sensing, and incorporate a reflective ink coating on the surface of the finger to enable rich tactile sensing. We design a highly underactuated exoskeleton with a tendon-driven mechanism to actuate the finger. Finally, we assemble 2 of the fingers together to form a robotic gripper and successfully perform a bar stock classification task, which requires both shape and tactile information. We train neural networks for proprioception and shape (box versus cylinder) classification using data from the embedded sensors. The proprioception CNN had over 99\% accuracy on our testing set (all six joint angles were within 1 degree of error) and had an average accumulative distance error of 0.77 mm during live testing, which is better than human finger proprioception. These proposed techniques offer soft robots the high-level ability to simultaneously perceive their proprioceptive state and peripheral environment, providing potential solutions for soft robots to solve everyday manipulation tasks. We believe the methods developed in this work can be widely applied to different designs and applications.