GraspMamba: A Mamba-based Language-driven Grasp Detection Framework with Hierarchical Feature Learning
This work addresses grasp detection for robotics applications, offering improved performance and speed, though it appears incremental as it adapts existing Mamba architectures to a specific domain.
The paper tackles the problem of language-driven grasp detection in cluttered environments with slow inference speeds by introducing GraspMamba, a Mamba-based framework with hierarchical feature learning, which outperforms recent methods and demonstrates fast inference in real-world robotic experiments.
Grasp detection is a fundamental robotic task critical to the success of many industrial applications. However, current language-driven models for this task often struggle with cluttered images, lengthy textual descriptions, or slow inference speed. We introduce GraspMamba, a new language-driven grasp detection method that employs hierarchical feature fusion with Mamba vision to tackle these challenges. By leveraging rich visual features of the Mamba-based backbone alongside textual information, our approach effectively enhances the fusion of multimodal features. GraspMamba represents the first Mamba-based grasp detection model to extract vision and language features at multiple scales, delivering robust performance and rapid inference time. Intensive experiments show that GraspMamba outperforms recent methods by a clear margin. We validate our approach through real-world robotic experiments, highlighting its fast inference speed.