CVApr 3, 2023

Follow Your Pose: Pose-Guided Text-to-Video Generation using Pose-Free Videos

arXiv:2304.01186v2331 citationsh-index: 47
Originality Incremental advance
AI Analysis

This addresses the demand for creating digital human videos, though it is incremental as it builds on existing text-to-image models.

The paper tackles the problem of generating text-editable and pose-controllable character videos by developing a two-stage training scheme that uses image-pose pairs and pose-free videos, achieving continuously pose-controllable videos while maintaining editing capabilities from pre-trained models.

Generating text-editable and pose-controllable character videos have an imperious demand in creating various digital human. Nevertheless, this task has been restricted by the absence of a comprehensive dataset featuring paired video-pose captions and the generative prior models for videos. In this work, we design a novel two-stage training scheme that can utilize easily obtained datasets (i.e.,image pose pair and pose-free video) and the pre-trained text-to-image (T2I) model to obtain the pose-controllable character videos. Specifically, in the first stage, only the keypoint-image pairs are used only for a controllable text-to-image generation. We learn a zero-initialized convolutional encoder to encode the pose information. In the second stage, we finetune the motion of the above network via a pose-free video dataset by adding the learnable temporal self-attention and reformed cross-frame self-attention blocks. Powered by our new designs, our method successfully generates continuously pose-controllable character videos while keeps the editing and concept composition ability of the pre-trained T2I model. The code and models will be made publicly available.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes