ROMar 16

Real-time Capable Learning-based Visual Tool Pose Correction via Differentiable Simulation

arXiv:2505.0887554.82 citationsh-index: 1

AI Analysis

This addresses the challenge of poor proprioception in surgical robots, enabling more efficient autonomous control, though it is incremental as it builds on existing vision-based methods.

The paper tackled the problem of inaccurate pose estimation for robot-assisted surgery by developing a real-time Vision Transformer-based method trained with differentiable simulation, which reduced hand-eye translation errors by over 50% and achieved inference at 22 Hz.

Autonomy in robot-assisted minimally invasive surgery has the potential to reduce surgeon cognitive and task load, thereby increasing procedural efficiency. However, implementing accurate autonomous control can be difficult due to poor end-effector proprioception. Joint encoder readings are typically inaccurate due to kinematic non-idealities in their cable-driven transmissions. Vision-based pose estimation approaches are highly effective, but lack real-time capability, generalizability, or can be hard to train. In this work, we demonstrate a real-time capable, Vision Transformer-based pose estimation approach that is trained using end-to-end differentiable kinematics and rendering. We demonstrate the potential of this approach to correct for noisy pose estimates through a real robot dataset and the potential real-time processing ability. Our approach is able to reduce more than 50% of hand-eye translation errors in the dataset, reaching the same performance level as an existing optimization-based method. Our approach is four times faster, and capable of near real-time inference at 22 Hz. A zero-shot prediction on an unseen dataset shows good generalization ability, and can be further finetuned for increased performance without human labeling.

View on arXiv PDF

Similar