When and Where do Events Switch in Multi-Event Video Generation?
This addresses the challenge of temporal coherence in multi-event video generation for AI and creative applications, but it is incremental as it builds on existing models like OpenSora and CogVideoX.
The paper tackled the problem of generating coherent multi-event videos from text prompts by investigating when and where event transitions occur during generation, finding that early intervention in denoising steps and block-wise layers is crucial for controlling these shifts.
Text-to-video (T2V) generation has surged in response to challenging questions, especially when a long video must depict multiple sequential events with temporal coherence and controllable content. Existing methods that extend to multi-event generation omit an inspection of the intrinsic factor in event shifting. The paper aims to answer the central question: When and where multi-event prompts control event transition during T2V generation. This work introduces MEve, a self-curated prompt suite for evaluating multi-event text-to-video (T2V) generation, and conducts a systematic study of two representative model families, i.e., OpenSora and CogVideoX. Extensive experiments demonstrate the importance of early intervention in denoising steps and block-wise model layers, revealing the essential factor for multi-event video generation and highlighting the possibilities for multi-event conditioning in future models.