CVMar 7, 2024

Embodied Understanding of Driving Scenarios

Yunsong Zhou, Linyan Huang, Qingwen Bu, Jia Zeng, Tianyu Li, Hang Qiu, Hongzi Zhu, Minyi Guo, Yu Qiao, Hongyang Li

arXiv:2403.04593v128.778 citationsh-index: 11Has CodeECCV

Originality Incremental advance

AI Analysis

This addresses the need for embodied scene understanding in autonomous driving, though it appears incremental as it builds upon existing Vision-Language Models with spatial and temporal enhancements.

The paper tackles the problem of autonomous agents lacking spatial awareness and long-horizon extrapolation in driving scenarios by introducing the Embodied Language Model (ELM), which surpasses previous state-of-the-art approaches in all aspects on a reformulated benchmark.

Embodied scene understanding serves as the cornerstone for autonomous agents to perceive, interpret, and respond to open driving scenarios. Such understanding is typically founded upon Vision-Language Models (VLMs). Nevertheless, existing VLMs are restricted to the 2D domain, devoid of spatial awareness and long-horizon extrapolation proficiencies. We revisit the key aspects of autonomous driving and formulate appropriate rubrics. Hereby, we introduce the Embodied Language Model (ELM), a comprehensive framework tailored for agents' understanding of driving scenes with large spatial and temporal spans. ELM incorporates space-aware pre-training to endow the agent with robust spatial localization capabilities. Besides, the model employs time-aware token selection to accurately inquire about temporal cues. We instantiate ELM on the reformulated multi-faced benchmark, and it surpasses previous state-of-the-art approaches in all aspects. All code, data, and models will be publicly shared.

View on arXiv PDF Code

Similar