CVMar 30, 2024

Grid Diffusion Models for Text-to-Video Generation

arXiv:2404.00234v223 citationsh-index: 3CVPR
Originality Incremental advance
AI Analysis

This work addresses the problem of efficient and scalable video generation for AI applications, though it appears incremental as it builds on existing diffusion models by adapting them to video without major architectural changes.

The authors tackled the challenge of text-to-video generation by proposing a grid diffusion method that represents videos as grid images, eliminating the need for temporal dimensions in architecture and large datasets. This approach achieved high-quality video generation with fixed GPU memory usage and outperformed existing methods in evaluations.

Recent advances in the diffusion models have significantly improved text-to-image generation. However, generating videos from text is a more challenging task than generating images from text, due to the much larger dataset and higher computational cost required. Most existing video generation methods use either a 3D U-Net architecture that considers the temporal dimension or autoregressive generation. These methods require large datasets and are limited in terms of computational costs compared to text-to-image generation. To tackle these challenges, we propose a simple but effective novel grid diffusion for text-to-video generation without temporal dimension in architecture and a large text-video paired dataset. We can generate a high-quality video using a fixed amount of GPU memory regardless of the number of frames by representing the video as a grid image. Additionally, since our method reduces the dimensions of the video to the dimensions of the image, various image-based methods can be applied to videos, such as text-guided video manipulation from image manipulation. Our proposed method outperforms the existing methods in both quantitative and qualitative evaluations, demonstrating the suitability of our model for real-world video generation.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes