CVDec 22, 2022

Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation

Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Weixian Lei, Yuchao Gu, Yufei Shi, Wynne Hsu, Ying Shan, Xiaohu Qie, Mike Zheng Shou

Tencent

arXiv:2212.11565v254.21153 citationsh-index: 62Has Code

Originality Incremental advance

AI Analysis

This addresses the problem of efficient video generation for AI and creative applications, though it is incremental as it builds on existing text-to-image models.

The paper tackles the computational expense of training text-to-video generators by proposing Tune-A-Video, a method that tunes pre-trained text-to-image diffusion models using only one text-video pair, achieving promising results in generating consistent and motion-aware videos.

To replicate the success of text-to-image (T2I) generation, recent works employ large-scale video datasets to train a text-to-video (T2V) generator. Despite their promising results, such paradigm is computationally expensive. In this work, we propose a new T2V generation setting$\unicode{x2014}$One-Shot Video Tuning, where only one text-video pair is presented. Our model is built on state-of-the-art T2I diffusion models pre-trained on massive image data. We make two key observations: 1) T2I models can generate still images that represent verb terms; 2) extending T2I models to generate multiple images concurrently exhibits surprisingly good content consistency. To further learn continuous motion, we introduce Tune-A-Video, which involves a tailored spatio-temporal attention mechanism and an efficient one-shot tuning strategy. At inference, we employ DDIM inversion to provide structure guidance for sampling. Extensive qualitative and numerical experiments demonstrate the remarkable ability of our method across various applications.

View on arXiv PDF Code

Similar