Hierarchical Patch Diffusion Models for High-Resolution Video Generation
This addresses the scalability limitations of diffusion models for high-resolution video synthesis, which is crucial for applications in video generation and editing, though it builds incrementally on existing patch diffusion methods.
The paper tackles the challenge of scaling diffusion models to high-resolution video generation by proposing hierarchical patch diffusion models with deep context fusion and adaptive computation, achieving state-of-the-art FVD and Inception scores of 66.32 and 87.68 on UCF-101 and enabling end-to-end training on resolutions up to 64×288×512.
Diffusion models have demonstrated remarkable performance in image and video synthesis. However, scaling them to high-resolution inputs is challenging and requires restructuring the diffusion pipeline into multiple independent components, limiting scalability and complicating downstream applications. This makes it very efficient during training and unlocks end-to-end optimization on high-resolution videos. We improve PDMs in two principled ways. First, to enforce consistency between patches, we develop deep context fusion -- an architectural technique that propagates the context information from low-scale to high-scale patches in a hierarchical manner. Second, to accelerate training and inference, we propose adaptive computation, which allocates more network capacity and computation towards coarse image details. The resulting model sets a new state-of-the-art FVD score of 66.32 and Inception Score of 87.68 in class-conditional video generation on UCF-101 $256^2$, surpassing recent methods by more than 100%. Then, we show that it can be rapidly fine-tuned from a base $36\times 64$ low-resolution generator for high-resolution $64 \times 288 \times 512$ text-to-video synthesis. To the best of our knowledge, our model is the first diffusion-based architecture which is trained on such high resolutions entirely end-to-end. Project webpage: https://snap-research.github.io/hpdm.