CVAIRONov 8, 2024

Integrating Object Detection Modality into Visual Language Model for Enhanced Autonomous Driving Agent

arXiv:2411.05898v13 citationsh-index: 5
Originality Incremental advance
AI Analysis

This work addresses the need for more capable and interpretable autonomous driving systems, though it appears incremental as it builds on existing architectures like Llama-Adapter and CLIP.

The paper tackles the problem of visual comprehension in autonomous driving by integrating an object detection module into a visual language model, resulting in significant improvements over baseline models on the DriveLM visual question answering challenge as measured by ChatGPT, BLEU, and CIDEr scores.

In this paper, we propose a novel framework for enhancing visual comprehension in autonomous driving systems by integrating visual language models (VLMs) with additional visual perception module specialised in object detection. We extend the Llama-Adapter architecture by incorporating a YOLOS-based detection network alongside the CLIP perception network, addressing limitations in object detection and localisation. Our approach introduces camera ID-separators to improve multi-view processing, crucial for comprehensive environmental awareness. Experiments on the DriveLM visual question answering challenge demonstrate significant improvements over baseline models, with enhanced performance in ChatGPT scores, BLEU scores, and CIDEr metrics, indicating closeness of model answer to ground truth. Our method represents a promising step towards more capable and interpretable autonomous driving systems. Possible safety enhancement enabled by detection modality is also discussed.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes