CVApr 3, 2023

Follow Your Pose: Pose-Guided Text-to-Video Generation using Pose-Free Videos

Yue Ma, Yingqing He, Xiaodong Cun, Xintao Wang, Siran Chen, Ying Shan, Xiu Li, Qifeng Chen

arXiv:2304.01186v241.6332 citationsh-index: 47Has Code

Originality Incremental advance

AI Analysis

This addresses the demand for creating digital human videos, though it is incremental as it builds on existing text-to-image models.

The paper tackles the problem of generating text-editable and pose-controllable character videos by developing a two-stage training scheme that uses image-pose pairs and pose-free videos, achieving continuously pose-controllable videos while maintaining editing capabilities from pre-trained models.

Generating text-editable and pose-controllable character videos have an imperious demand in creating various digital human. Nevertheless, this task has been restricted by the absence of a comprehensive dataset featuring paired video-pose captions and the generative prior models for videos. In this work, we design a novel two-stage training scheme that can utilize easily obtained datasets (i.e.,image pose pair and pose-free video) and the pre-trained text-to-image (T2I) model to obtain the pose-controllable character videos. Specifically, in the first stage, only the keypoint-image pairs are used only for a controllable text-to-image generation. We learn a zero-initialized convolutional encoder to encode the pose information. In the second stage, we finetune the motion of the above network via a pose-free video dataset by adding the learnable temporal self-attention and reformed cross-frame self-attention blocks. Powered by our new designs, our method successfully generates continuously pose-controllable character videos while keeps the editing and concept composition ability of the pre-trained T2I model. The code and models will be made publicly available.

View on arXiv PDF Code

Similar