CVAug 24, 2024

Decoupled Video Generation with Chain of Training-free Diffusion Model Experts

Wenhao Li, Yichao Cao, Xiu Su, Xi Lin, Shan You, Mingkai Zheng, Yi Chen, Chang Xu

arXiv:2408.13423v42.0h-index: 19

Originality Highly original

AI Analysis

This addresses efficiency and quality issues in video generation for applications like filmmaking, representing a strong incremental improvement with novel method integration.

The paper tackles the problem of high computational costs and suboptimal results in video generation by proposing ConFiner, a framework that decouples video generation into subtasks using a chain of diffusion model experts, achieving superior performance with only 10% of the inference cost compared to models like Lavie and Modelscope, and generating coherent videos up to 600 frames.

Video generation models hold substantial potential in areas such as filmmaking. However, current video diffusion models need high computational costs and produce suboptimal results due to extreme complexity of video generation task. In this paper, we propose \textbf{ConFiner}, an efficient video generation framework that decouples video generation into easier subtasks: structure \textbf{con}trol and spatial-temporal re\textbf{fine}ment. It can generate high-quality videos with chain of off-the-shelf diffusion model experts, each expert responsible for a decoupled subtask. During the refinement, we introduce coordinated denoising, which can merge multiple diffusion experts' capabilities into a single sampling. Furthermore, we design ConFiner-Long framework, which can generate long coherent video with three constraint strategies on ConFiner. Experimental results indicate that with only 10\% of the inference cost, our ConFiner surpasses representative models like Lavie and Modelscope across all objective and subjective metrics. And ConFiner-Long can generate high-quality and coherent videos with up to 600 frames.

View on arXiv PDF

Similar