CVCLMay 8

Tracing the Arrow of Time: Diagnosing Temporal Information Flow in Video-LLMs

arXiv:2605.0756870.6
AI Analysis

For researchers building Video-LLMs, this work identifies and fixes a critical bottleneck in temporal information flow, enabling superior temporal reasoning.

The paper diagnoses why Video-LLMs struggle with the Arrow-of-Time task (determining forward/backward playback), finding that while video-centric encoders encode strong temporal signals, the projector (especially Q-Former) creates a bottleneck. By using a temporal-aware encoder, time-preserved MLP projector, and AoT supervision, they achieve 98.1% accuracy on AoT_PPB, surpassing humans, and improve temporal reasoning tasks by up to 6.0 points.

The Arrow-of-Time (AoT) task, determining whether a video plays forward or backward by recognizing temporal irreversibility, is one humans solve with near-perfect accuracy, yet frontier Video Large Language Models (Video-LLMs) perform only modestly above chance. This gap raises a key question: do visual backbones fail to encode temporal information, or does information bottleneck lie elsewhere in the Video-LLM architecture? We address this question by isolating the vision encoder from the Video-LLM and tracing temporal information across the encoder, projector, and LLM. We find that video-centric encoders with explicit temporal modeling encode strong temporal signals, whereas frame-centric encoders do not. However, when video-centric representations are passed through a standard Video-LLM architecture, performance often collapses, revealing a bottleneck of temporal information flow. We identify projector design as a key factor: Q-Former disrupts temporal information, while a time-preserved MLP projection substantially improves the LLM's access to such information. Our layer-wise analysis further shows temporal representation dynamics across encoder layers. Guided by these findings, we build a Video-LLM with temporal-aware video-centric encoder, time-preserved projector, and AoT supervision, surpassing human performance on AoT$_{PPB}$ with 98.1\% accuracy, and improving broader temporal reasoning tasks by up to 6.0 points on VITATECS-Direction and 1.3 points on TVBench. Our results show that temporal reasoning in Video-LLMs requires both effective temporal encoding and reliable transfer of this information to the LLM.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes