CVJan 28, 2025

Scenario Understanding of Traffic Scenes Through Large Visual Language Models

arXiv:2501.17131v211 citationsh-index: 62025 IEEE/CVF Winter Conference on Applications of Computer Vision Workshops (WACVW)
Originality Incremental advance
AI Analysis

This addresses the bottleneck of manual data annotation in autonomous driving, offering a scalable solution for improving model generalization across diverse domains.

The study tackled the problem of automating scene-based categorization for autonomous driving datasets by evaluating Large Visual Language Models (LVLMs) like GPT-4 and LLaVA on urban traffic scenes, demonstrating their effectiveness through quantitative metrics and qualitative insights.

Deep learning models for autonomous driving, encompassing perception, planning, and control, depend on vast datasets to achieve their high performance. However, their generalization often suffers due to domain-specific data distributions, making an effective scene-based categorization of samples necessary to improve their reliability across diverse domains. Manual captioning, though valuable, is both labor-intensive and time-consuming, creating a bottleneck in the data annotation process. Large Visual Language Models (LVLMs) present a compelling solution by automating image analysis and categorization through contextual queries, often without requiring retraining for new categories. In this study, we evaluate the capabilities of LVLMs, including GPT-4 and LLaVA, to understand and classify urban traffic scenes on both an in-house dataset and the BDD100K. We propose a scalable captioning pipeline that integrates state-of-the-art models, enabling a flexible deployment on new datasets. Our analysis, combining quantitative metrics with qualitative insights, demonstrates the effectiveness of LVLMs to understand urban traffic scenarios and highlights their potential as an efficient tool for data-driven advancements in autonomous driving.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes