CVMMSep 19, 2025

Zero-Shot Visual Grounding in 3D Gaussians via View Retrieval

arXiv:2509.15871v14 citationsh-index: 5Has Code
Originality Highly original
AI Analysis

This addresses the challenge of locating objects in 3D scenes for applications like robotics, offering a more efficient approach compared to existing methods.

The paper tackles the problem of 3D visual grounding in 3D Gaussian Splatting by proposing a zero-shot framework that avoids per-scene training and large labeled datasets, achieving state-of-the-art performance.

3D Visual Grounding (3DVG) aims to locate objects in 3D scenes based on text prompts, which is essential for applications such as robotics. However, existing 3DVG methods encounter two main challenges: first, they struggle to handle the implicit representation of spatial textures in 3D Gaussian Splatting (3DGS), making per-scene training indispensable; second, they typically require larges amounts of labeled data for effective training. To this end, we propose \underline{G}rounding via \underline{V}iew \underline{R}etrieval (GVR), a novel zero-shot visual grounding framework for 3DGS to transform 3DVG as a 2D retrieval task that leverages object-level view retrieval to collect grounding clues from multiple views, which not only avoids the costly process of 3D annotation, but also eliminates the need for per-scene training. Extensive experiments demonstrate that our method achieves state-of-the-art visual grounding performance while avoiding per-scene training, providing a solid foundation for zero-shot 3DVG research. Video demos can be found in https://github.com/leviome/GVR_demos.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes