Tuning-Free Long Video Generation via Global-Local Collaborative Diffusion
This addresses the challenge of spatiotemporal inconsistencies and high computational demands in long video generation for applications in media and AI, representing an incremental improvement over prior methods.
The paper tackles the problem of generating coherent, high-fidelity long videos by proposing GLC-Diffusion, a tuning-free method that integrates with existing video diffusion models to produce videos 3x and 6x longer with improved consistency and visual quality.
Creating high-fidelity, coherent long videos is a sought-after aspiration. While recent video diffusion models have shown promising potential, they still grapple with spatiotemporal inconsistencies and high computational resource demands. We propose GLC-Diffusion, a tuning-free method for long video generation. It models the long video denoising process by establishing denoising trajectories through Global-Local Collaborative Denoising to ensure overall content consistency and temporal coherence between frames. Additionally, we introduce a Noise Reinitialization strategy which combines local noise shuffling with frequency fusion to improve global content consistency and visual diversity. Further, we propose a Video Motion Consistency Refinement (VMCR) module that computes the gradient of pixel-wise and frequency-wise losses to enhance visual consistency and temporal smoothness. Extensive experiments, including quantitative and qualitative evaluations on videos of varying lengths (\textit{e.g.}, 3\times and 6\times longer), demonstrate that our method effectively integrates with existing video diffusion models, producing coherent, high-fidelity long videos superior to previous approaches.