Robot Manipulation in Salient Vision through Referring Image Segmentation and Geometric Constraints
This addresses robot manipulation for real-world applications by reducing reliance on labor-intensive annotations, though it appears incremental as it builds on existing segmentation and servoing techniques.
The paper tackles robot manipulation in real-world environments by integrating a lightweight referring image segmentation model (CLIPU$^2$Net) with geometric constraints, enabling control from language expressions. It outperforms traditional methods on 46 tasks and achieves fine-grain segmentation with a decoder size of 6.6 MB.
In this paper, we perform robot manipulation activities in real-world environments with language contexts by integrating a compact referring image segmentation model into the robot's perception module. First, we propose CLIPU$^2$Net, a lightweight referring image segmentation model designed for fine-grain boundary and structure segmentation from language expressions. Then, we deploy the model in an eye-in-hand visual servoing system to enact robot control in the real world. The key to our system is the representation of salient visual information as geometric constraints, linking the robot's visual perception to actionable commands. Experimental results on 46 real-world robot manipulation tasks demonstrate that our method outperforms traditional visual servoing methods relying on labor-intensive feature annotations, excels in fine-grain referring image segmentation with a compact decoder size of 6.6 MB, and supports robot control across diverse contexts.