UniGround: Universal 3D Visual Grounding via Training-Free Scene Parsing
This work is significant for embodied AI and robotics, as it improves the generalization and robustness of 3D visual grounding to unseen spatial relationships and out-of-distribution scenes, without requiring 3D supervision.
This paper addresses the challenge of 3D Visual Grounding (3DVG) in complex 3D environments, aiming to locate objects from natural language descriptions. The proposed UniGround method, which uses training-free visual and geometric reasoning, achieves 46.1% Acc@0.25 and 34.1% Acc@0.5 on ScanRefer, and 28.7% Acc@0.25 on EmbodiedScan, setting a new state-of-the-art for zero-shot methods on EmbodiedScan without 3D supervision.
Understanding and localizing objects in complex 3D environments from natural language descriptions, known as 3D Visual Grounding (3DVG), is a foundational challenge in embodied AI, with broad implications for robotics, augmented reality, and human-machine interaction. Large-scale pre-trained foundation models have driven significant progress on this front, enabling open-vocabulary 3DVG that allows systems to locate arbitrary objects in a given scene. However, their reliance on pre-trained models constrains 3D perception and reasoning within the inherited knowledge boundaries, resulting in limited generalization to unseen spatial relationships and poor robustness to out-of-distribution scenes. In this paper, we replace this constrained perception with training-free visual and geometric reasoning, thereby unlocking open-world 3DVG that enables the localization of any object in any scene beyond the training data. Specifically, the proposed UniGround operates in two stages: a Global Candidate Filtering stage that constructs scene candidates through training-free 3D topology and multi-view semantic encoding, and a Local Precision Grounding stage that leverages multi-scale visual prompting and structured reasoning to precisely identify the target object. Experiments on ScanRefer and EmbodiedScan show that UniGround achieves 46.1\%/34.1\% Acc@0.25/0.5 on ScanRefer and 28.7\% Acc@0.25 on EmbodiedScan, establishing a new state-of-the-art among zero-shot methods on EmbodiedScan without any 3D supervision. We further evaluate UniGround in real-world environments under uncontrolled reconstruction conditions and substantial domain shift, showing training-free reasoning generalizes robustly beyond curated benchmarks.