86.9ROApr 23Code
FingerViP: Learning Real-World Dexterous Manipulation with Fingertip Visual PerceptionZhen Zhang, Weinan Wang, Hejia Sun et al.
The current practice of dexterous manipulation generally relies on a single wrist-mounted view, which is often occluded and limits performance on tasks requiring multi-view perception. In this work, we present FingerViP, a learning system that utilizes a visuomotor policy with fingertip visual perception for dexterous manipulation. Specifically, we design a vision-enhanced fingertip module with an embedded miniature camera and install the modules on each finger of a multi-fingered hand. The fingertip cameras substantially improve visual perception by providing comprehensive, multi-view feedback of both the hand and its surrounding environment. Building on the integrated fingertip modules, we develop a diffusion-based whole-body visuomotor policy conditioned on a third-view camera and multi-view fingertip vision, which effectively learns complex manipulation skills directly from human demonstrations. To improve view-proprioception alignment and contact awareness, each fingertip visual feature is augmented with its corresponding camera pose encoding and per-finger joint-current encoding. We validate the effectiveness of the multi-view fingertip vision and demonstrate the robustness and adaptability of FingerViP on various challenging real-world tasks, including pressing buttons inside a confined box, retrieving sticks from an unstable support, retrieving objects behind an occluding curtain, and performing long-horizon cabinet opening and object retrieval, achieving an overall success rate of 80.8%. All hardware designs and code will be fully open-sourced.
99.6ROApr 22
Open-H-Embodiment: A Large-Scale Dataset for Enabling Foundation Models in Medical RoboticsOpen-H-Embodiment Consortium, Nigel Nelson, Juo-Tung Chen et al.
Autonomous medical robots hold promise to improve patient outcomes, reduce provider workload, democratize access to care, and enable superhuman precision. However, autonomous medical robotics has been limited by a fundamental data problem: existing medical robotic datasets are small, single-embodiment, and rarely shared openly, restricting the development of foundation models that the field needs to advance. We introduce Open-H-Embodiment, the largest open dataset of medical robotic video with synchronized kinematics to date, spanning more than 49 institutions and multiple robotic platforms including the CMR Versius, Intuitive Surgical's da Vinci, da Vinci Research Kit (dVRK), Rob Surgical BiTrack, Virtual Incision's MIRA, Moon Surgical Maestro, and a variety of custom systems, spanning surgical manipulation, robotic ultrasound, and endoscopy procedures. We demonstrate the research enabled by this dataset through two foundation models. GR00T-H is the first open foundation vision-language-action model for medical robotics, which is the only evaluated model to achieve full end-to-end task completion on a structured suturing benchmark (25% of trials vs. 0% for all others) and achieves 64% average success across a 29-step ex vivo suturing sequence. We also train Cosmos-H-Surgical-Simulator, the first action-conditioned world model to enable multi-embodiment surgical simulation from a single checkpoint, spanning nine robotic platforms and supporting in silico policy evaluation and synthetic data generation for the medical domain. These results suggest that open, large-scale medical robot data collection can serve as critical infrastructure for the research community, enabling advances in robot learning, world modeling, and beyond.
ROOct 13, 2023
Interactive Navigation in Environments with Traversable Obstacles Using Large Language and Vision-Language ModelsZhen Zhang, Anran Lin, Chun Wai Wong et al.
This paper proposes an interactive navigation framework by using large language and vision-language models, allowing robots to navigate in environments with traversable obstacles. We utilize the large language model (GPT-3.5) and the open-set Vision-language Model (Grounding DINO) to create an action-aware costmap to perform effective path planning without fine-tuning. With the large models, we can achieve an end-to-end system from textual instructions like "Can you pass through the curtains to deliver medicines to me?", to bounding boxes (e.g., curtains) with action-aware attributes. They can be used to segment LiDAR point clouds into two parts: traversable and untraversable parts, and then an action-aware costmap is constructed for generating a feasible path. The pre-trained large models have great generalization ability and do not require additional annotated data for training, allowing fast deployment in the interactive navigation tasks. We choose to use multiple traversable objects such as curtains and grasses for verification by instructing the robot to traverse them. Besides, traversing curtains in a medical scenario was tested. All experimental results demonstrated the proposed framework's effectiveness and adaptability to diverse environments.
IVMar 21, 2024Code
VibNet: Vibration-Boosted Needle Detection in Ultrasound ImagesDianye Huang, Chenyang Li, Angelos Karlas et al.
Precise percutaneous needle detection is crucial for ultrasound (US)-guided interventions. However, inherent limitations such as speckles, needle-like artifacts, and low resolution make it challenging to robustly detect needles, especially when their visibility is reduced or imperceptible. To address this challenge, we propose VibNet, a learning-based framework designed to enhance the robustness and accuracy of needle detection in US images by leveraging periodic vibration applied externally to the needle shafts. VibNet integrates neural Short-Time Fourier Transform and Hough Transform modules to achieve successive sub-goals, including motion feature extraction in the spatiotemporal space, frequency feature aggregation, and needle detection in the Hough space. Due to the periodic subtle vibration, the features are more robust in the frequency domain than in the image intensity domain, making VibNet more effective than traditional intensity-based methods. To demonstrate the effectiveness of VibNet, we conducted experiments on distinct \textit{ex vivo} porcine and bovine tissue samples. The results obtained on porcine samples demonstrate that VibNet effectively detects needles even when their visibility is severely reduced, with a tip error of $1.61\pm1.56~mm$ compared to $8.15\pm9.98~mm$ for UNet and $6.63\pm7.58~mm$ for WNet, and a needle direction error of $1.64\pm1.86^{\circ}$ compared to $9.29\pm15.30^{\circ}$ for UNet and $8.54\pm17.92^{\circ}$ for WNet. Code: https://github.com/marslicy/VibNet.
ROMay 22, 2025
Manipulating Elasto-Plastic Objects With 3D Occupancy and Learning-Based Predictive ControlZhen Zhang, Xiangyu Chu, Yunxi Tang et al.
Manipulating elasto-plastic objects remains a significant challenge due to severe self-occlusion, difficulties of representation, and complicated dynamics. This work proposes a novel framework for elasto-plastic object manipulation with a quasi-static assumption for motions, leveraging 3D occupancy to represent such objects, a learned dynamics model trained with 3D occupancy, and a learning-based predictive control algorithm to address these challenges effectively. We build a novel data collection platform to collect full spatial information and propose a pipeline for generating a 3D occupancy dataset. To infer the 3D occupancy during manipulation, an occupancy prediction network is trained with multiple RGB images supervised by the generated dataset. We design a deep neural network empowered by a 3D convolution neural network (CNN) and a graph neural network (GNN) to predict the complex deformation with the inferred 3D occupancy results. A learning-based predictive control algorithm is introduced to plan the robot actions, incorporating a novel shape-based action initialization module specifically designed to improve the planner efficiency. The proposed framework in this paper can successfully shape the elasto-plastic objects into a given goal shape and has been verified in various experiments both in simulation and the real world.
CVOct 29, 2024
DOFS: A Real-world 3D Deformable Object Dataset with Full Spatial Information for Dynamics Model LearningZhen Zhang, Xiangyu Chu, Yunxi Tang et al.
This work proposes DOFS, a pilot dataset of 3D deformable objects (DOs) (e.g., elasto-plastic objects) with full spatial information (i.e., top, side, and bottom information) using a novel and low-cost data collection platform with a transparent operating plane. The dataset consists of active manipulation action, multi-view RGB-D images, well-registered point clouds, 3D deformed mesh, and 3D occupancy with semantics, using a pinching strategy with a two-parallel-finger gripper. In addition, we trained a neural network with the down-sampled 3D occupancy and action as input to model the dynamics of an elasto-plastic object. Our dataset and all CADs of the data collection system will be released soon on our website.
ROAug 9, 2025
Vibration-Based Energy Metric for Restoring Needle Alignment in Autonomous Robotic UltrasoundZhongyu Chen, Chenyang Li, Xuesong Li et al.
Precise needle alignment is essential for percutaneous needle insertion in robotic ultrasound-guided procedures. However, inherent challenges such as speckle noise, needle-like artifacts, and low image resolution make robust needle detection difficult, particularly when visibility is reduced or lost. In this paper, we propose a method to restore needle alignment when the ultrasound imaging plane and the needle insertion plane are misaligned. Unlike many existing approaches that rely heavily on needle visibility in ultrasound images, our method uses a more robust feature by periodically vibrating the needle using a mechanical system. Specifically, we propose a vibration-based energy metric that remains effective even when the needle is fully out of plane. Using this metric, we develop a control strategy to reposition the ultrasound probe in response to misalignments between the imaging plane and the needle insertion plane in both translation and rotation. Experiments conducted on ex-vivo porcine tissue samples using a dual-arm robotic ultrasound-guided needle insertion system demonstrate the effectiveness of the proposed approach. The experimental results show the translational error of 0.41$\pm$0.27 mm and the rotational error of 0.51$\pm$0.19 degrees.
ROSep 15, 2021
Inequality Constrained Trajectory Optimization with A Hybrid Multiple-shooting iLQRYunxi Tang, Xiangyu Chu, Wanxin Jin et al.
Trajectory optimization has been used extensively in robotic systems. In particular, iterative Linear Quadratic Regulator (iLQR) has performed well as an off-line planner and online nonlinear model predictive control solver, with a lower computational cost. However, standard iLQR cannot handle any constraints or perform reasonable initialization of a state trajectory. In this paper, we propose a hybrid constrained iLQR variant with a multiple-shooting framework to incorporate general inequality constraints and infeasible states initialization. The main technical contributions are twofold: 1) In addition to inheriting the simplicity of the initialization in multiple-shooting settings, a two-stage framework is developed to deal with state and/or control constraints robustly without loss of the linear feedback term of iLQR. Such a hybrid strategy offers fast convergence of constraint satisfaction. 2) An improved globalization strategy is proposed to exploit the coupled effects between line-searching and regularization, which is able to enhance the numerical robustness of the constrained iLQR approaches. Our approach is tested on various constrained trajectory optimization problems and outperforms the commonly-used collocation and shooting methods.