CascadeV: An Implementation of Wurstchen Architecture for Video Generation
This addresses the problem of high computational costs in text-to-video generation for researchers and practitioners, representing an incremental improvement by cascading with existing models.
The paper tackles the computational challenges of generating high-resolution videos with diffusion models by proposing CascadeV, a cascaded latent diffusion model that achieves state-of-the-art 2K resolution videos and a higher compression ratio to reduce computational demands.
Recently, with the tremendous success of diffusion models in the field of text-to-image (T2I) generation, increasing attention has been directed toward their potential in text-to-video (T2V) applications. However, the computational demands of diffusion models pose significant challenges, particularly in generating high-resolution videos with high frame rates. In this paper, we propose CascadeV, a cascaded latent diffusion model (LDM), that is capable of producing state-of-the-art 2K resolution videos. Experiments demonstrate that our cascaded model achieves a higher compression ratio, substantially reducing the computational challenges associated with high-quality video generation. We also implement a spatiotemporal alternating grid 3D attention mechanism, which effectively integrates spatial and temporal information, ensuring superior consistency across the generated video frames. Furthermore, our model can be cascaded with existing T2V models, theoretically enabling a 4$\times$ increase in resolution or frames per second without any fine-tuning. Our code is available at https://github.com/bytedance/CascadeV.