CVJul 15, 2025

ViewSRD: 3D Visual Grounding via Structured Multi-View Decomposition

Ronggang Huang, Haoxin Yang, Yan Cai, Xuemiao Xu, Huaidong Zhang, Shengfeng He

arXiv:2507.11261v27 citationsh-index: 24Has Code

Originality Highly original

AI Analysis

This work solves the problem of accurately localizing objects in 3D spaces based on textual descriptions for applications like robotics and augmented reality, representing a strong specific gain rather than a foundational advancement.

The paper tackles the problem of 3D visual grounding by addressing challenges in disentangling targets from anchors in complex queries and resolving spatial inconsistencies due to perspective variations, resulting in a framework that significantly outperforms state-of-the-art methods, especially in complex queries requiring precise spatial differentiation.

3D visual grounding aims to identify and localize objects in a 3D space based on textual descriptions. However, existing methods struggle with disentangling targets from anchors in complex multi-anchor queries and resolving inconsistencies in spatial descriptions caused by perspective variations. To tackle these challenges, we propose ViewSRD, a framework that formulates 3D visual grounding as a structured multi-view decomposition process. First, the Simple Relation Decoupling (SRD) module restructures complex multi-anchor queries into a set of targeted single-anchor statements, generating a structured set of perspective-aware descriptions that clarify positional relationships. These decomposed representations serve as the foundation for the Multi-view Textual-Scene Interaction (Multi-TSI) module, which integrates textual and scene features across multiple viewpoints using shared, Cross-modal Consistent View Tokens (CCVTs) to preserve spatial correlations. Finally, a Textual-Scene Reasoning module synthesizes multi-view predictions into a unified and robust 3D visual grounding. Experiments on 3D visual grounding datasets show that ViewSRD significantly outperforms state-of-the-art methods, particularly in complex queries requiring precise spatial differentiation. Code is available at https://github.com/visualjason/ViewSRD.

View on arXiv PDF Code

Similar