CVAIAug 29, 2024

DriveGenVLM: Real-world Video Generation for Vision Language Model based Autonomous Driving

arXiv:2408.16647v114 citationsh-index: 12
Originality Synthesis-oriented
AI Analysis

This work addresses the need for sophisticated scenario prediction in autonomous driving by combining video generation and VLMs, though it appears incremental as it applies existing methods to a specific domain.

The paper tackles the problem of generating realistic driving videos for autonomous driving by proposing DriveGenVLM, a framework that uses denoising diffusion probabilistic models trained on the Waymo dataset, achieving evaluation with Fréchet Video Distance scores, and integrates these videos with vision language models like EILEV to produce narrations for enhanced scene understanding.

The advancement of autonomous driving technologies necessitates increasingly sophisticated methods for understanding and predicting real-world scenarios. Vision language models (VLMs) are emerging as revolutionary tools with significant potential to influence autonomous driving. In this paper, we propose the DriveGenVLM framework to generate driving videos and use VLMs to understand them. To achieve this, we employ a video generation framework grounded in denoising diffusion probabilistic models (DDPM) aimed at predicting real-world video sequences. We then explore the adequacy of our generated videos for use in VLMs by employing a pre-trained model known as Efficient In-context Learning on Egocentric Videos (EILEV). The diffusion model is trained with the Waymo open dataset and evaluated using the Fréchet Video Distance (FVD) score to ensure the quality and realism of the generated videos. Corresponding narrations are provided by EILEV for these generated videos, which may be beneficial in the autonomous driving domain. These narrations can enhance traffic scene understanding, aid in navigation, and improve planning capabilities. The integration of video generation with VLMs in the DriveGenVLM framework represents a significant step forward in leveraging advanced AI models to address complex challenges in autonomous driving.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes