CVMar 4

InfinityStory: Unlimited Video Generation with World Consistency and Character-Aware Shot Transitions

Mohamed Elmoghany, Liangbing Zhao, Xiaoqian Shen, Subhojyoti Mukherjee, Yang Zhou, Gang Wu, Viet Dac Lai, Seunghyun Yoon, Ryan Rossi, Abdullah Rashwan, Puneet Mathur, Varun Manjunatha

AI2

arXiv:2603.03646v12.81 citationsh-index: 32

Originality Highly original

AI Analysis

This work solves the problem of maintaining visual coherence and smooth transitions in long-form video generation for applications in storytelling and content creation, representing a significant advance beyond prior single-subject limitations.

The paper tackles the challenge of generating long-form storytelling videos with consistent visual narratives by introducing a framework that addresses background consistency, seamless multi-subject shot transitions, and scalability to hour-long narratives. It achieves state-of-the-art results, including a Background Consistency score of 88.94 and Subject Consistency of 82.11 on VBench.

Generating long-form storytelling videos with consistent visual narratives remains a significant challenge in video synthesis. We present a novel framework, dataset, and a model that address three critical limitations: background consistency across shots, seamless multi-subject shot-to-shot transitions, and scalability to hour-long narratives. Our approach introduces a background-consistent generation pipeline that maintains visual coherence across scenes while preserving character identity and spatial relationships. We further propose a transition-aware video synthesis module that generates smooth shot transitions for complex scenarios involving multiple subjects entering or exiting frames, going beyond the single-subject limitations of prior work. To support this, we contribute with a synthetic dataset of 10,000 multi-subject transition sequences covering underrepresented dynamic scene compositions. On VBench, InfinityStory achieves the highest Background Consistency (88.94), highest Subject Consistency (82.11), and the best overall average rank (2.80), showing improved stability, smoother transitions, and better temporal coherence.

View on arXiv PDF

Similar