CVMar 28, 2025

NuGrounding: A Multi-View 3D Visual Grounding Framework in Autonomous Driving

Fuhao Li, Huan Jin, Bin Gao, Liaoyuan Fan, Lihui Jiang, Long Zeng

arXiv:2503.22436v215.510 citationsh-index: 4

Originality Incremental advance

AI Analysis

This addresses the challenge of interpreting natural language to localize objects in complex driving environments, though it is incremental as it builds on existing 3D scene understanding methods.

The paper tackles the problem of multi-view 3D visual grounding in autonomous driving by introducing NuGrounding, a large-scale benchmark with hierarchical instructions, and a method combining multi-modal LLMs with detection models, achieving 0.59 precision and 0.64 recall with improvements of 50.8% and 54.7% over baselines.

Multi-view 3D visual grounding is critical for autonomous driving vehicles to interpret natural languages and localize target objects in complex environments. However, existing datasets and methods suffer from coarse-grained language instructions, and inadequate integration of 3D geometric reasoning with linguistic comprehension. To this end, we introduce NuGrounding, the first large-scale benchmark for multi-view 3D visual grounding in autonomous driving. We present a Hierarchy of Grounding (HoG) method to construct NuGrounding to generate hierarchical multi-level instructions, ensuring comprehensive coverage of human instruction patterns. To tackle this challenging dataset, we propose a novel paradigm that seamlessly combines instruction comprehension abilities of multi-modal LLMs (MLLMs) with precise localization abilities of specialist detection models. Our approach introduces two decoupled task tokens and a context query to aggregate 3D geometric information and semantic instructions, followed by a fusion decoder to refine spatial-semantic feature fusion for precise localization. Extensive experiments demonstrate that our method significantly outperforms the baselines adapted from representative 3D scene understanding methods by a significant margin and achieves 0.59 in precision and 0.64 in recall, with improvements of 50.8% and 54.7%.

View on arXiv PDF

Similar