CVMar 11

DriveXQA: Cross-modal Visual Question Answering for Adverse Driving Scene Understanding

Mingzhe Tao, Ruiping Liu, Junwei Zheng, Yufan Chen, Kedi Ying, M. Saquib Sarfraz, Kailun Yang, Jiaming Zhang, Rainer Stiefelhagen

arXiv:2603.11380v112.2Has Code

Predicted impact top 26% in CV · last 90 daysOriginality Incremental advance

AI Analysis

This work addresses the need for robust scene understanding in autonomous driving by introducing a dataset and method for cross-modal VQA, though it is incremental as it builds on existing MLLM frameworks.

The authors tackled the problem of understanding adverse driving scenes in autonomous vehicles by creating DriveXQA, a multimodal dataset with 102,505 QA pairs, and proposing MVX-LLM, a token-efficient architecture that improved performance under challenging conditions like foggy weather (GPTScore: 53.5 vs. 25.1 for the baseline).

Fusing sensors with complementary modalities is crucial for maintaining a stable and comprehensive understanding of abnormal driving scenes. However, Multimodal Large Language Models (MLLMs) are underexplored for leveraging multi-sensor information to understand adverse driving scenarios in autonomous vehicles. To address this gap, we propose the DriveXQA, a multimodal dataset for autonomous driving VQA. In addition to four visual modalities, five sensor failure cases, and five weather conditions, it includes $102,505$ QA pairs categorized into three types: global scene level, allocentric level, and ego-vehicle centric level. Since no existing MLLM framework adopts multiple complementary visual modalities as input, we design MVX-LLM, a token-efficient architecture with a Dual Cross-Attention (DCA) projector that fuses the modalities to alleviate information redundancy. Experiments demonstrate that our DCA achieves improved performance under challenging conditions such as foggy (GPTScore: $53.5$ vs. $25.1$ for the baseline). The established dataset and source code will be made publicly available.

View on arXiv PDF Code

Similar