CVDec 14, 2024

Grid: Omni Visual Generation

Cong Wan, Xiangyang Luo, Hao Luo, Zijian Cai, Yiren Song, Yunlong Zhao, Yifan Bai, Fan Wang, Yuhang He, Yihong Gong

arXiv:2412.10718v512.19 citationsh-index: 14Has Code

Originality Highly original

AI Analysis

This provides an efficient and versatile solution for visual generation tasks like text-to-video and 3D editing, addressing computational bottlenecks in the field.

The paper tackles the challenge of extending visual generation from single images to temporal sequences by introducing GRID, which reformulates sequences as grid layouts to leverage existing image models, achieving up to 67 times faster inference speeds and using less than 1/1000 of the computational resources compared to specialized models.

Visual generation has witnessed remarkable progress in single-image tasks, yet extending these capabilities to temporal sequences remains challenging. Current approaches either build specialized video models from scratch with enormous computational costs or add separate motion modules to image generators, both requiring learning temporal dynamics anew. We observe that modern image generation models possess underutilized potential in handling structured layouts with implicit temporal understanding. Building on this insight, we introduce GRID, which reformulates temporal sequences as grid layouts, enabling holistic processing of visual sequences while leveraging existing model capabilities. Through a parallel flow-matching training strategy with coarse-to-fine scheduling, our approach achieves up to 67 faster inference speeds while using <1/1000 of the computational resources compared to specialized models. Extensive experiments demonstrate that GRID not only excels in temporal tasks from Text-to-Video to 3D Editing but also preserves strong performance in image generation, establishing itself as an efficient and versatile omni-solution for visual generation.

View on arXiv PDF Code

Similar