CVMay 28

YoCausal: How Far is Video Generation from World Model? A Causality Perspective

arXiv:2605.3034672.1
Predicted impact top 41% in CV · last 90 daysOriginality Highly original
AI Analysis

For researchers developing video generation models as world models, this benchmark exposes the critical limitation that models overfit to temporal statistics rather than learning causality.

The paper introduces YoCausal, a two-level benchmark for evaluating causal understanding in video diffusion models, revealing that current models perceive temporal order but fail at genuine causal reasoning, with a significant gap compared to human cognition.

As video diffusion models (VDMs) advance toward world models, a key question arises: do they truly understand causality, or merely overfit to statistical temporal patterns? Existing benchmarks mostly rely on synthetic data, limiting real-world generalization due to the sim-to-real gap. We present YoCausal, a two-level benchmark inspired by the Violation of Expectation (VoE) paradigm from cognitive science. By temporally reversing real-world videos at zero cost as natural counterfactual samples, YoCausal establishes an arbitrarily extensible evaluation protocol. Level 1 introduces the Reverse Surprise Index (RSI), quantifying arrow-of-time perception via denoising loss. Level 2 introduces the Causality Cognition Index (CCI), which leverages a VLM to stratify datasets into causal and non-causal subsets, disentangling genuine causal reasoning from temporal bias. Evaluation of 13 state-of-the-art VDMs reveals that perceiving the arrow of time does not imply understanding causality, and a significant gap persists relative to human-level causal cognition.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes