CVJun 3, 2025

Seeing the Arrow of Time in Large Multimodal Models

arXiv:2506.03340v217 citationsh-index: 9
AI Analysis

This addresses a fundamental limitation in video comprehension for LMMs, offering a novel training approach with strong performance gains, though it is incremental in focusing on a specific temporal bottleneck.

The paper tackles the challenge of large multimodal models (LMMs) struggling to perceive temporal directionality (Arrow of Time) in videos for language queries, introducing ArrowRL, a reinforcement learning training strategy with a reverse reward. Experiments show ArrowRL achieves substantial improvements, with peak accuracy gains of over 20% on their new AoTBench and over 10% on standard video VQA benchmarks.

The Arrow of Time (AoT)-time's irreversible flow shaping physical events-is fundamental to video comprehension, yet remains a significant challenge for modern large multimodal models (LMMs). Current LMMs struggle to perceive and utilize temporal directionality in video when responding to language queries, obstructing deeper temporal understanding. We tackle this deficiency by first providing a critical analysis of existing benchmarks and models. We then introduce ArrowRL, a reinforcement learning (RL)-based training strategy with an innovative reverse reward that instills AoT awareness by encouraging divergent video interpretations between forward and reversed visual frames. For rigorous evaluation, we additionally develop AoTBench, a new multi-faceted benchmark probing temporally challenging questions. Experiments show ArrowRL greatly advances temporal perception: it not only achieves substantial improvements on our challenging AoTBench but also demonstrably boosts performance on standard video question answering (VQA) benchmarks (with peak accuracy gains reaching over 20% and 10% respectively). This validates ArrowRL's effectiveness and highlights the critical need for dedicated AoT understanding in LMMs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes