CVDec 6, 2024

Mind the Time: Temporally-Controlled Multi-Event Video Generation

arXiv:2412.05263v232 citationsh-index: 29Has CodeCVPR
Originality Highly original
AI Analysis

This addresses the challenge of precise temporal control in video generation for applications requiring multi-event sequences, representing a novel advancement rather than an incremental improvement.

The paper tackles the problem of generating videos with multiple events in correct temporal order, which existing methods fail to do, and introduces MinT, a model that achieves this by binding events to specific time periods, outperforming existing models by a large margin.

Real-world videos consist of sequences of events. Generating such sequences with precise temporal control is infeasible with existing video generators that rely on a single paragraph of text as input. When tasked with generating multiple events described using a single prompt, such methods often ignore some of the events or fail to arrange them in the correct order. To address this limitation, we present MinT, a multi-event video generator with temporal control. Our key insight is to bind each event to a specific period in the generated video, which allows the model to focus on one event at a time. To enable time-aware interactions between event captions and video tokens, we design a time-based positional encoding method, dubbed ReRoPE. This encoding helps to guide the cross-attention operation. By fine-tuning a pre-trained video diffusion transformer on temporally grounded data, our approach produces coherent videos with smoothly connected events. For the first time in the literature, our model offers control over the timing of events in generated videos. Extensive experiments demonstrate that MinT outperforms existing commercial and open-source models by a large margin.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes