CVDec 9, 2024

World knowledge-enhanced Reasoning Using Instruction-guided Interactor in Autonomous Driving

Mingliang Zhai, Cheng Li, Zengyuan Guo, Ningrui Yang, Xiameng Qin, Sanyuan Zhao, Junyu Han, Ji Tao, Yuwei Wu, Yunde Jia

arXiv:2412.06324v39.610 citationsh-index: 18AAAI

Originality Incremental advance

AI Analysis

This addresses safety risks for vulnerable road users in autonomous driving by improving reasoning under occlusion, though it appears incremental as it builds on existing MLLM frameworks.

The paper tackles the problem of autonomous driving systems struggling to integrate perception with world knowledge in perception-limited areas by proposing an instruction-guided interaction module and collecting a large-scale multi-modal dataset, achieving validated effectiveness through extensive experiments.

The Multi-modal Large Language Models (MLLMs) with extensive world knowledge have revitalized autonomous driving, particularly in reasoning tasks within perceivable regions. However, when faced with perception-limited areas (dynamic or static occlusion regions), MLLMs struggle to effectively integrate perception ability with world knowledge for reasoning. These perception-limited regions can conceal crucial safety information, especially for vulnerable road users. In this paper, we propose a framework, which aims to improve autonomous driving performance under perceptionlimited conditions by enhancing the integration of perception capabilities and world knowledge. Specifically, we propose a plug-and-play instruction-guided interaction module that bridges modality gaps and significantly reduces the input sequence length, allowing it to adapt effectively to multi-view video inputs. Furthermore, to better integrate world knowledge with driving-related tasks, we have collected and refined a large-scale multi-modal dataset that includes 2 million natural language QA pairs, 1.7 million grounding task data. To evaluate the model's utilization of world knowledge, we introduce an object-level risk assessment dataset comprising 200K QA pairs, where the questions necessitate multi-step reasoning leveraging world knowledge for resolution. Extensive experiments validate the effectiveness of our proposed method.

View on arXiv PDF

Similar