CVROMay 20

SceneGraphGrounder: Zero-Shot 3D Visual Grounding via Structured Scene Graph Matching

arXiv:2605.2178834.4
AI Analysis

For researchers in 3D vision and robotics, this work provides an interpretable and spatially consistent zero-shot grounding method, though performance is only competitive (not SOTA) among zero-shot approaches.

SceneGraphGrounder reformulates zero-shot 3D visual grounding as structured graph matching over a reconstructed 3D scene graph, achieving competitive performance on ScanRefer among zero-shot approaches using only RGB-D inputs.

Zero-shot 3D visual grounding requires localizing objects in unstructured environments from free-form natural language. Recent vision-language model (VLM) approaches achieve promising results but rely on view-dependent reasoning or implicit representations, limiting spatial consistency and interpretability for compositional queries. We propose SceneGraphGrounder, a framework that reformulates 3D grounding as structured graph matching over a reconstructed 3D scene graph. To enable this formulation, we introduce a visual marker prompting strategy that enables a VLM to infer object-object relationships from 2D views, which are subsequently lifted into a persistent 3D scene graph encoding both spatial and semantic relations. Given a query, we construct a query graph and perform constrained alignment with the scene graph, ensuring multi-view consistency and interpretable reasoning. Experiments on the ScanRefer benchmark demonstrate that our method achieves competitive performance among zero-shot approaches, using only RGB-D inputs. We further validate our framework through real-world deployment on a mobile robot, demonstrating robust spatial reasoning in long-horizon physical environments. We will make our code publicly available upon acceptance.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes