CVAIETLGROJul 5, 2025

Pedestrian Intention Prediction via Vision-Language Foundation Models

arXiv:2507.04141v13 citationsh-index: 62025 IEEE Intelligent Vehicles Symposium (IV)
Originality Synthesis-oriented
AI Analysis

This addresses the problem of generalizability and context understanding in autonomous driving, though it is incremental as it applies existing foundation models to a specific domain.

The study tackled pedestrian crossing intention prediction for autonomous vehicles by using vision-language foundation models with hierarchical prompt templates, achieving up to 19.8% accuracy improvement and an additional 12.5% gain from optimized prompts.

Prediction of pedestrian crossing intention is a critical function in autonomous vehicles. Conventional vision-based methods of crossing intention prediction often struggle with generalizability, context understanding, and causal reasoning. This study explores the potential of vision-language foundation models (VLFMs) for predicting pedestrian crossing intentions by integrating multimodal data through hierarchical prompt templates. The methodology incorporates contextual information, including visual frames, physical cues observations, and ego-vehicle dynamics, into systematically refined prompts to guide VLFMs effectively in intention prediction. Experiments were conducted on three common datasets-JAAD, PIE, and FU-PIP. Results demonstrate that incorporating vehicle speed, its variations over time, and time-conscious prompts significantly enhances the prediction accuracy up to 19.8%. Additionally, optimised prompts generated via an automatic prompt engineering framework yielded 12.5% further accuracy gains. These findings highlight the superior performance of VLFMs compared to conventional vision-based models, offering enhanced generalisation and contextual understanding for autonomous driving applications.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes