ROMar 16

Real-time Capable Learning-based Visual Tool Pose Correction via Differentiable Simulation

arXiv:2505.0887554.82 citationsh-index: 1
AI Analysis

This addresses the challenge of poor proprioception in surgical robots, enabling more efficient autonomous control, though it is incremental as it builds on existing vision-based methods.

The paper tackled the problem of inaccurate pose estimation for robot-assisted surgery by developing a real-time Vision Transformer-based method trained with differentiable simulation, which reduced hand-eye translation errors by over 50% and achieved inference at 22 Hz.

Autonomy in robot-assisted minimally invasive surgery has the potential to reduce surgeon cognitive and task load, thereby increasing procedural efficiency. However, implementing accurate autonomous control can be difficult due to poor end-effector proprioception. Joint encoder readings are typically inaccurate due to kinematic non-idealities in their cable-driven transmissions. Vision-based pose estimation approaches are highly effective, but lack real-time capability, generalizability, or can be hard to train. In this work, we demonstrate a real-time capable, Vision Transformer-based pose estimation approach that is trained using end-to-end differentiable kinematics and rendering. We demonstrate the potential of this approach to correct for noisy pose estimates through a real robot dataset and the potential real-time processing ability. Our approach is able to reduce more than 50% of hand-eye translation errors in the dataset, reaching the same performance level as an existing optimization-based method. Our approach is four times faster, and capable of near real-time inference at 22 Hz. A zero-shot prediction on an unseen dataset shows good generalization ability, and can be further finetuned for increased performance without human labeling.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes