CVAIDec 3, 2025

Think Before You Drive: World Model-Inspired Multimodal Grounding for Autonomous Vehicles

arXiv:2512.03454v11 citationsh-index: 15
Originality Incremental advance
AI Analysis

This addresses a critical challenge in autonomous vehicles by improving visual grounding for ambiguous instructions, though it is incremental as it builds on existing world model and multimodal methods.

The paper tackles the problem of interpreting ambiguous natural-language commands for object localization in autonomous driving by proposing ThinkDeeper, a framework that uses a world model to reason about future spatial states, resulting in state-of-the-art performance on benchmarks like Talk2Car and DrivePilot with strong robustness and efficiency.

Interpreting natural-language commands to localize target objects is critical for autonomous driving (AD). Existing visual grounding (VG) methods for autonomous vehicles (AVs) typically struggle with ambiguous, context-dependent instructions, as they lack reasoning over 3D spatial relations and anticipated scene evolution. Grounded in the principles of world models, we propose ThinkDeeper, a framework that reasons about future spatial states before making grounding decisions. At its core is a Spatial-Aware World Model (SA-WM) that learns to reason ahead by distilling the current scene into a command-aware latent state and rolling out a sequence of future latent states, providing forward-looking cues for disambiguation. Complementing this, a hypergraph-guided decoder then hierarchically fuses these states with the multimodal input, capturing higher-order spatial dependencies for robust localization. In addition, we present DrivePilot, a multi-source VG dataset in AD, featuring semantic annotations generated by a Retrieval-Augmented Generation (RAG) and Chain-of-Thought (CoT)-prompted LLM pipeline. Extensive evaluations on six benchmarks, ThinkDeeper ranks #1 on the Talk2Car leaderboard and surpasses state-of-the-art baselines on DrivePilot, MoCAD, and RefCOCO/+/g benchmarks. Notably, it shows strong robustness and efficiency in challenging scenes (long-text, multi-agent, ambiguity) and retains superior performance even when trained on 50% of the data.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes