Free-Bloom: Zero-Shot Text-to-Video Generator with LLM Director and LDM Animator
This addresses the problem of generating coherent videos from text for applications in content creation and AI, though it is incremental as it builds on existing LLM and LDM methods.
The paper tackles zero-shot text-to-video generation by proposing Free-Bloom, a pipeline that uses LLMs as directors to create semantic-coherent prompt sequences and LDMs as animators to generate high-fidelity frames, achieving vivid and high-quality videos without video data or training.
Text-to-video is a rapidly growing research area that aims to generate a semantic, identical, and temporal coherence sequence of frames that accurately align with the input text prompt. This study focuses on zero-shot text-to-video generation considering the data- and cost-efficient. To generate a semantic-coherent video, exhibiting a rich portrayal of temporal semantics such as the whole process of flower blooming rather than a set of "moving images", we propose a novel Free-Bloom pipeline that harnesses large language models (LLMs) as the director to generate a semantic-coherence prompt sequence, while pre-trained latent diffusion models (LDMs) as the animator to generate the high fidelity frames. Furthermore, to ensure temporal and identical coherence while maintaining semantic coherence, we propose a series of annotative modifications to adapting LDMs in the reverse process, including joint noise sampling, step-aware attention shift, and dual-path interpolation. Without any video data and training requirements, Free-Bloom generates vivid and high-quality videos, awe-inspiring in generating complex scenes with semantic meaningful frame sequences. In addition, Free-Bloom is naturally compatible with LDMs-based extensions.