CVDec 15, 2025

What Happens Next? Next Scene Prediction with a Unified Video Model

arXiv:2512.13015v1
Originality Highly original
AI Analysis

This addresses the underexplored temporal reasoning potential of unified video models for advancing generalist multimodal systems.

The paper tackles the problem of predicting plausible future scenes from preceding video context, introducing the Next Scene Prediction (NSP) task to push unified video models toward temporal and causal reasoning. The proposed model achieves state-of-the-art performance on their new benchmark.

Recent unified models for joint understanding and generation have significantly advanced visual generation capabilities. However, their focus on conventional tasks like text-to-video generation has left the temporal reasoning potential of unified models largely underexplored. To address this gap, we introduce Next Scene Prediction (NSP), a new task that pushes unified video models toward temporal and causal reasoning. Unlike text-to-video generation, NSP requires predicting plausible futures from preceding context, demanding deeper understanding and reasoning. To tackle this task, we propose a unified framework combining Qwen-VL for comprehension and LTX for synthesis, bridged by a latent query embedding and a connector module. This model is trained in three stages on our newly curated, large-scale NSP dataset: text-to-video pre-training, supervised fine-tuning, and reinforcement learning (via GRPO) with our proposed causal consistency reward. Experiments demonstrate our model achieves state-of-the-art performance on our benchmark, advancing the capability of generalist multimodal systems to anticipate what happens next.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes