CLCVJun 12, 2025

Burn After Reading: Do Multimodal Large Language Models Truly Capture Order of Events in Image Sequences?

arXiv:2506.10415v15 citationsh-index: 2Has CodeACL
Originality Synthesis-oriented
AI Analysis

This addresses a critical limitation in MLLMs for applications requiring event sequence understanding, though it is incremental as it focuses on benchmarking rather than solving the problem.

The paper tackles the problem of assessing temporal grounding and reasoning in Multimodal Large Language Models (MLLMs) using the TempVS benchmark, showing that 38 state-of-the-art models struggle significantly with performance gaps compared to humans.

This paper introduces the TempVS benchmark, which focuses on temporal grounding and reasoning capabilities of Multimodal Large Language Models (MLLMs) in image sequences. TempVS consists of three main tests (i.e., event relation inference, sentence ordering and image ordering), each accompanied with a basic grounding test. TempVS requires MLLMs to rely on both visual and linguistic modalities to understand the temporal order of events. We evaluate 38 state-of-the-art MLLMs, demonstrating that models struggle to solve TempVS, with a substantial performance gap compared to human capabilities. We also provide fine-grained insights that suggest promising directions for future research. Our TempVS benchmark data and code are available at https://github.com/yjsong22/TempVS.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes