CVLGMay 21

Enhancing Multimodal Large Language Models for Safety-Critical Driving Video Analysis

arXiv:2605.2218575.7Has Code
Predicted impact top 35% in CV · last 90 daysOriginality Synthesis-oriented
AI Analysis

For autonomous driving safety, this work provides a method to enhance MLLMs for rare high-stakes events, though it is an incremental improvement over existing MLLM fine-tuning approaches.

The paper addresses the inability of Multimodal Large Language Models (MLLMs) to accurately perceive and reason about rare safety-critical driving events. By fusing video frames with telematics data and semantic insights, and fine-tuning QwenVL-2.5 with DoRA adapters, they achieve significant improvements in identifying and explaining such events with fewer than 50M trainable parameters.

Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities in general visual understanding. However, their application to safety-critical driving scenarios remains limited by an inability to accurately perceive and reason about rare high-stakes dynamic events, such as collisions or near-collisions. To address this, we introduce a pipeline that enhances MLLM perception by fusing downsampled video frames with synchronized high-frequency telematics data (IMU and GPS) and semantic insights from specialized computer vision models. Our pipeline generates high-quality pseudo-labels, including descriptive captions and question-answer pairs, specifically designed to train MLLMs to identify and describe Safety-Critical Events (SCEs) in real-world driving footage. We show the effectiveness of our approach fine-tuning the open-source QwenVL-2.5 model via DoRA adapters: our experiments demonstrate significant improvements in identifying and explaining safety-critical events, with fewer than 50M trainable parameters and limited computational budget.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes