Optical-Flow Guided Prompt Optimization for Coherent Video Generation
This addresses the challenge of generating temporally consistent videos for users of text-to-video models, representing an incremental improvement through guidance techniques.
The paper tackles the problem of temporal inconsistency in text-to-video diffusion models by proposing MotionPrompt, a framework that uses optical flow guidance to optimize prompts during generation, resulting in videos with improved visual coherence and natural motion dynamics.
While text-to-video diffusion models have made significant strides, many still face challenges in generating videos with temporal consistency. Within diffusion frameworks, guidance techniques have proven effective in enhancing output quality during inference; however, applying these methods to video diffusion models introduces additional complexity of handling computations across entire sequences. To address this, we propose a novel framework called MotionPrompt that guides the video generation process via optical flow. Specifically, we train a discriminator to distinguish optical flow between random pairs of frames from real videos and generated ones. Given that prompts can influence the entire video, we optimize learnable token embeddings during reverse sampling steps by using gradients from a trained discriminator applied to random frame pairs. This approach allows our method to generate visually coherent video sequences that closely reflect natural motion dynamics, without compromising the fidelity of the generated content. We demonstrate the effectiveness of our approach across various models.