CVAICLJul 1, 2024

ScanReason: Empowering 3D Visual Grounding with Reasoning Capabilities

arXiv:2407.01525v30.2261 citationsh-index: 11
AI Analysis70

This addresses the limitation of existing 3D visual grounding models that cannot handle implicit instructions requiring reasoning, which is important for applications like robotics and augmented reality.

The authors tackled the problem of 3D visual grounding models lacking reasoning capabilities by introducing a new 3D reasoning grounding task and benchmark called ScanReason with over 10K question-answer-location pairs, and their ReGround3D approach achieved strong performance on this benchmark.

Although great progress has been made in 3D visual grounding, current models still rely on explicit textual descriptions for grounding and lack the ability to reason human intentions from implicit instructions. We propose a new task called 3D reasoning grounding and introduce a new benchmark ScanReason which provides over 10K question-answer-location pairs from five reasoning types that require the synerization of reasoning and grounding. We further design our approach, ReGround3D, composed of the visual-centric reasoning module empowered by Multi-modal Large Language Model (MLLM) and the 3D grounding module to obtain accurate object locations by looking back to the enhanced geometry and fine-grained details from the 3D scenes. A chain-of-grounding mechanism is proposed to further boost the performance with interleaved reasoning and grounding steps during inference. Extensive experiments on the proposed benchmark validate the effectiveness of our proposed approach.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes