CVAILGROJan 12, 2025

Application of Vision-Language Model to Pedestrians Behavior and Scene Understanding in Autonomous Driving

arXiv:2501.06680v22 citationsh-index: 32025 4th International Conference on Robotics, Artificial Intelligence and Intelligent Control (RAIIC)
Originality Incremental advance
AI Analysis

This work addresses the problem of complex scenario understanding for autonomous driving systems, representing an incremental advancement through knowledge distillation and ensemble techniques.

The paper tackles the challenge of applying vision-language models to pedestrian behavior prediction and scene understanding in autonomous driving by proposing a knowledge distillation method that transfers knowledge from large foundation models to efficient vision networks, achieving significant metric improvements in open-vocabulary perception and trajectory prediction tasks.

Vision-language models (VLMs) have become a promising approach to enhancing perception and decision-making in autonomous driving. The gap remains in applying VLMs to understand complex scenarios interacting with pedestrians and efficient vehicle deployment. In this paper, we propose a knowledge distillation method that transfers knowledge from large-scale vision-language foundation models to efficient vision networks, and we apply it to pedestrian behavior prediction and scene understanding tasks, achieving promising results in generating more diverse and comprehensive semantic attributes. We also utilize multiple pre-trained models and ensemble techniques to boost the model's performance. We further examined the effectiveness of the model after knowledge distillation; the results show significant metric improvements in open-vocabulary perception and trajectory prediction tasks, which can potentially enhance the end-to-end performance of autonomous driving.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes