Xianyi Cheng

RO
h-index6
6papers
1,507citations
Novelty48%
AI Score51

6 Papers

AIJul 25, 2023
WebArena: A Realistic Web Environment for Building Autonomous Agents

Shuyan Zhou, Frank F. Xu, Hao Zhu et al. · cmu

With advances in generative AI, there is now potential for autonomous agents to manage daily tasks via natural language commands. However, current agents are primarily created and tested in simplified synthetic environments, leading to a disconnect with real-world scenarios. In this paper, we build an environment for language-guided agents that is highly realistic and reproducible. Specifically, we focus on agents that perform tasks on the web, and create an environment with fully functional websites from four common domains: e-commerce, social forum discussions, collaborative software development, and content management. Our environment is enriched with tools (e.g., a map) and external knowledge bases (e.g., user manuals) to encourage human-like task-solving. Building upon our environment, we release a set of benchmark tasks focusing on evaluating the functional correctness of task completions. The tasks in our benchmark are diverse, long-horizon, and designed to emulate tasks that humans routinely perform on the internet. We experiment with several baseline agents, integrating recent techniques such as reasoning before acting. The results demonstrate that solving complex tasks is challenging: our best GPT-4-based agent only achieves an end-to-end task success rate of 14.41%, significantly lower than the human performance of 78.24%. These results highlight the need for further development of robust agents, that current state-of-the-art large language models are far from perfect performance in these real-life tasks, and that WebArena can be used to measure such progress.

32.3ROMay 28
Any-ttach: Quick End-effector Swapping Enables Manipulation Dexterity with Simplicity

Weizhe Ni, Jinzhou Li, Haoyu Li et al.

Robotic manipulation dexterity is often pursued by building increasingly complex high-DoF multifingered hands. While many robotic hands are designed to replicate human morphology, the functional role of human hands suggests a different perspective: much of their complexity may exist to enable tool use and tool making. This observation motivates Any-ttach, a tool-centric manipulation framework that treats quick end-effector swapping as a mechanism for dexterity with simplicity. Any-ttach combines a low-cost automatic swapping mechanism for an open-close robot interface, a handheld device for collecting human demonstrations, and a task planning framework that composes learned, parameterized, and planned tool-use skills. The system supports diverse tools and end-effector modules, including daily tools, articulated tools such as scissors, Fin Ray fingers, and a low-cost anthropomorphic hand, through the same shared interface. Our experiments show that Any-ttach improves tool-swapping reliability, increases demonstration efficiency, reduces tool-pose variability, and supports diverse tool-use skills. In two long-horizon tasks, making a sandwich and preparing a cucumber, Any-ttach executes six tool-use subskills through end-effector switching and execution monitoring. These results suggest that robots can expand manipulation capability not only through more complex end-effectors, but also through rapidly exchangeable tools and end-effector modules. More details and videos are available at https://any-ttach.github.io/.

79.7ROMay 19
CEER: Compliant End-Effector and Root Control as a Unified Interface for Hierarchical Humanoid Loco-Manipulation

Xinyuan Luo, Xingrui Chen, Xunjian Yin et al.

Humanoid robots have achieved impressive locomotion performance, yet contact-rich and long-horizon manipulation remains a major bottleneck. Manipulation is inherently contact-rich and demands compliant whole-body control for stable interaction, while its diversity and long-horizon nature favor modular, planner-compatible interfaces over joint-space tracking. We propose CEER, a compliant end-effector-root (EE-root) control abstraction for modular humanoid loco-manipulation within a hierarchical planning framework. CEER enables compliance-aware whole-body control in an interpretable task space defined by root motion commands and end-effector pose targets, and supports plug-and-play integration with heterogeneous high-level planners. A teacher-student framework is adopted to distill a general motion-tracking controller into a low-level policy that consumes only EE-root commands. We further construct a hierarchical system that integrates heterogeneous planners and task modules through the EE-root interface, enabling diverse manipulation tasks without retraining the underlying whole-body policy. Experiments in simulation and on hardware demonstrate 3.3 cm end-effector tracking accuracy with substantially reduced jerk compared to baselines, stable contact-rich manipulation under teleoperation, and up to 70% success in simulated single-object loco-manipulation tasks within a room-scale environment. These results indicate that compliant EE-root control provides a practical abstraction for humanoid loco-manipulation, enabling modular and scalable integration of diverse skills.

CVDec 18, 2025
OPENTOUCH: Bringing Full-Hand Touch to Real-World Interaction

Yuxin Ray Song, Jinzhou Li, Rao Fu et al.

The human hand is our primary interface to the physical world, yet egocentric perception rarely knows when, where, or how forcefully it makes contact. Robust wearable tactile sensors are scarce, and no existing in-the-wild datasets align first-person video with full-hand touch. To bridge the gap between visual perception and physical interaction, we present OpenTouch, the first in-the-wild egocentric full-hand tactile dataset, containing 5.1 hours of synchronized video-touch-pose data and 2,900 curated clips with detailed text annotations. Using OpenTouch, we introduce retrieval and classification benchmarks that probe how touch grounds perception and action. We show that tactile signals provide a compact yet powerful cue for grasp understanding, strengthen cross-modal alignment, and can be reliably retrieved from in-the-wild video queries. By releasing this annotated vision-touch-pose dataset and benchmark, we aim to advance multimodal egocentric perception, embodied learning, and contact-rich robotic manipulation.

ROMay 30, 2021
Contact Mode Guided Motion Planning for Quasidynamic Dexterous Manipulation in 3D

Xianyi Cheng, Eric Huang, Yifan Hou et al.

This paper presents Contact Mode Guided Manipulation Planning (CMGMP) for 3D quasistatic and quasidynamic rigid body motion planning in dexterous manipulation. The CMGMP algorithm generates hybrid motion plans including both continuous state transitions and discrete contact mode switches, without the need for pre-specified contact sequences or pre-designed motion primitives. The key idea is to use automatically enumerated contact modes of environment-object contacts to guide the tree expansions during the search. Contact modes automatically synthesize manipulation primitives, while the sampling-based planning framework sequences those primitives into a coherent plan. We test our algorithm on fourteen 3D manipulation tasks, and validate our models by executing some plans open-loop on a real robot-manipulator system

RONov 3, 2020
Contact Mode Guided Sampling-Based Planning for Quasistatic Dexterous Manipulation in 2D

Xianyi Cheng, Eric Huang, Yifan Hou et al.

The discontinuities and multi-modality introduced by contacts make manipulation planning challenging. Many previous works avoid this problem by pre-designing a set of high-level motion primitives like grasping and pushing. However, such motion primitives are often not adequate to describe dexterous manipulation motions. In this work, we propose a method for dexterous manipulation planning at a more primitive level. The key idea is to use contact modes to guide the search in a sampling-based planning framework. Our method can automatically generate contact transitions and motion trajectories under the quasistatic assumption. In the experiments, this method sometimes generates motions that are often pre-designed as motion primitives, as well as dexterous motions that are more task-specific.