GraspSplats: Efficient Manipulation with 3D Feature Splatting
This addresses the need for practical robot manipulation with efficient part localization, though it appears incremental by improving on existing methods like NeRFs and point-based approaches.
The paper tackles the problem of enabling robots to perform efficient and zero-shot grasping of object parts by bridging the 2D-to-3D gap in representations, proposing GraspSplats which generates high-quality scene representations in under 60 seconds and significantly outperforms existing methods in diverse task settings.
The ability for robots to perform efficient and zero-shot grasping of object parts is crucial for practical applications and is becoming prevalent with recent advances in Vision-Language Models (VLMs). To bridge the 2D-to-3D gap for representations to support such a capability, existing methods rely on neural fields (NeRFs) via differentiable rendering or point-based projection methods. However, we demonstrate that NeRFs are inappropriate for scene changes due to their implicitness and point-based methods are inaccurate for part localization without rendering-based optimization. To amend these issues, we propose GraspSplats. Using depth supervision and a novel reference feature computation method, GraspSplats generates high-quality scene representations in under 60 seconds. We further validate the advantages of Gaussian-based representation by showing that the explicit and optimized geometry in GraspSplats is sufficient to natively support (1) real-time grasp sampling and (2) dynamic and articulated object manipulation with point trackers. With extensive experiments on a Franka robot, we demonstrate that GraspSplats significantly outperforms existing methods under diverse task settings. In particular, GraspSplats outperforms NeRF-based methods like F3RM and LERF-TOGO, and 2D detection methods.