CVAIDec 4, 2025

From Segments to Scenes: Temporal Understanding in Autonomous Driving via Vision-Language Model

arXiv:2512.05277v21 citationsh-index: 10Has Code
Originality Incremental advance
AI Analysis

This work addresses the problem of temporal reasoning in ego-centric autonomous driving footage for researchers and developers, by providing a new benchmark and methods, though it is incremental as it builds on existing vision-language models.

This paper tackles the challenge of temporal understanding in autonomous driving by introducing the Temporal Understanding in Autonomous Driving (TAD) benchmark, which includes nearly 6,000 question-answer pairs across 7 tasks, and proposes two training-free solutions, Scene-CoT and TCogMap, that improve average accuracy on TAD by up to 17.72%.

Temporal understanding in autonomous driving (AD) remains a significant challenge, even for recent state-of-the-art (SoTA) Vision-Language Models (VLMs). Prior work has introduced datasets and benchmarks aimed at improving temporal reasoning, but these have emphasized other video content, including sports, cooking, and movies. No existing benchmark focuses exclusively on the unique challenges of temporal understanding in ego-centric AD footage. To fill this gap, the Temporal Understanding in Autonomous Driving (TAD) benchmark is presented, which evaluates VLMs' ability to capture the dynamic relationships between actions in AD. TAD comprises nearly 6,000 question-answer (QA) pairs, spanning 7 human-designed tasks. In addition, an evaluation is performed that consists of 9 closed- and open-source generalist models as well as SoTA AD specialist models. When applied to TAD, current SoTA models demonstrated substandard accuracies, largely due to imperfect fine-grained motion understanding. To improve motion understanding and overall accuracy on TAD, two novel training-free solutions are proposed: Scene-CoT, that leverages Chain-of-Thought (CoT) and TCogMap, which incorporates an ego-centric temporal cognitive map. The proposed approaches are integrated with existing VLMs and improve average accuracy on TAD by up to 17.72%. By introducing TAD, benchmarking multiple SoTA models, and proposing effective enhancements, this work aims to catalyze future research on temporal understanding in AD. The benchmark and evaluation code are available at \href{https://huggingface.co/datasets/vbdai/TAD}{Hugging Face} and \href{https://github.com/vbdi/tad_bench}{Github}, respectively.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes