ROAICVSYJun 11

Bounding Boxes as Goals: Language-Conditioned Grasping via Neuro-Symbolic Planning

arXiv:2606.12910v16.8
Predicted impact top 71% in RO · last 90 daysOriginality Incremental advance
AI Analysis

This work addresses the need for efficient, zero-shot language-conditioned grasping in robotics, offering a lightweight alternative to computationally heavy approaches.

GRASP enables robots to interpret natural-language prompts for tabletop manipulation without task-specific training, achieving 73.3% overall success across 90 real-robot trials at three difficulty levels.

For robotics to be effectively integrated into household or industrial environments, machines must adapt to natural-language prompts in real time. Although Vision-Language Models (VLMs) have enabled zero-shot generalization in robot task and motion planning (TAMP), current state-of-the-art approaches often remain computationally "heavyweight" or require extensive training on thousands of demonstrations. We present GRASP (Grounded Reasoning and Symbolic Planning), a framework designed as a step toward open-vocabulary tabletop manipulation. Our approach leverages a pretrained VLM to translate natural-language queries into neuro-symbolic goal states, grounded in the physical world via a bounding-box detection pipeline. Unlike methods that rely on fixed color lists or hard-coded coordinates, GRASP enables robots to interpret abstract spatial concepts such as "top shelf" and execute tasks without additional fine-tuning. We achieve 73.3% overall success across 90 real-robot trials at three difficulty levels, requiring no task-specific training.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes