CVMar 11

DriveXQA: Cross-modal Visual Question Answering for Adverse Driving Scene Understanding

arXiv:2603.11380v124.4h-index: 10
Predicted impact top 26% in CV · last 90 daysOriginality Incremental advance
AI Analysis

This work addresses the need for robust scene understanding in autonomous driving by introducing a dataset and method for cross-modal VQA, though it is incremental as it builds on existing MLLM frameworks.

The authors tackled the problem of understanding adverse driving scenes in autonomous vehicles by creating DriveXQA, a multimodal dataset with 102,505 QA pairs, and proposing MVX-LLM, a token-efficient architecture that improved performance under challenging conditions like foggy weather (GPTScore: 53.5 vs. 25.1 for the baseline).

Fusing sensors with complementary modalities is crucial for maintaining a stable and comprehensive understanding of abnormal driving scenes. However, Multimodal Large Language Models (MLLMs) are underexplored for leveraging multi-sensor information to understand adverse driving scenarios in autonomous vehicles. To address this gap, we propose the DriveXQA, a multimodal dataset for autonomous driving VQA. In addition to four visual modalities, five sensor failure cases, and five weather conditions, it includes $102,505$ QA pairs categorized into three types: global scene level, allocentric level, and ego-vehicle centric level. Since no existing MLLM framework adopts multiple complementary visual modalities as input, we design MVX-LLM, a token-efficient architecture with a Dual Cross-Attention (DCA) projector that fuses the modalities to alleviate information redundancy. Experiments demonstrate that our DCA achieves improved performance under challenging conditions such as foggy (GPTScore: $53.5$ vs. $25.1$ for the baseline). The established dataset and source code will be made publicly available.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes