CV CLMay 8

Tracing the Arrow of Time: Diagnosing Temporal Information Flow in Video-LLMs

Peitao Han, Fei Cheng, Lis K. Pereira, Qianying Liu, Shigeru Kitazawa

arXiv:2605.0756870.6

AI Analysis

For researchers building Video-LLMs, this work identifies and fixes a critical bottleneck in temporal information flow, enabling superior temporal reasoning.

The paper diagnoses why Video-LLMs struggle with the Arrow-of-Time task (determining forward/backward playback), finding that while video-centric encoders encode strong temporal signals, the projector (especially Q-Former) creates a bottleneck. By using a temporal-aware encoder, time-preserved MLP projector, and AoT supervision, they achieve 98.1% accuracy on AoT_PPB, surpassing humans, and improve temporal reasoning tasks by up to 6.0 points.

The Arrow-of-Time (AoT) task, determining whether a video plays forward or backward by recognizing temporal irreversibility, is one humans solve with near-perfect accuracy, yet frontier Video Large Language Models (Video-LLMs) perform only modestly above chance. This gap raises a key question: do visual backbones fail to encode temporal information, or does information bottleneck lie elsewhere in the Video-LLM architecture? We address this question by isolating the vision encoder from the Video-LLM and tracing temporal information across the encoder, projector, and LLM. We find that video-centric encoders with explicit temporal modeling encode strong temporal signals, whereas frame-centric encoders do not. However, when video-centric representations are passed through a standard Video-LLM architecture, performance often collapses, revealing a bottleneck of temporal information flow. We identify projector design as a key factor: Q-Former disrupts temporal information, while a time-preserved MLP projection substantially improves the LLM's access to such information. Our layer-wise analysis further shows temporal representation dynamics across encoder layers. Guided by these findings, we build a Video-LLM with temporal-aware video-centric encoder, time-preserved projector, and AoT supervision, surpassing human performance on AoT$_{PPB}$ with 98.1\% accuracy, and improving broader temporal reasoning tasks by up to 6.0 points on VITATECS-Direction and 1.3 points on TVBench. Our results show that temporal reasoning in Video-LLMs requires both effective temporal encoding and reliable transfer of this information to the LLM.

View on arXiv PDF

Similar