CVMar 31, 2025

HumanDreamer: Generating Controllable Human-Motion Videos via Decoupled Generation

arXiv:2503.24026v222 citationsh-index: 14CVPR
Originality Highly original
AI Analysis

This work addresses the problem of flexible human-motion video generation for applications in animation and virtual reality, representing a novel method for a known bottleneck.

The paper tackles the challenge of generating controllable human-motion videos by proposing HumanDreamer, a decoupled framework that first generates diverse poses from text prompts and then uses them to create videos, resulting in a 62.4% improvement in FID and enhancements in R-precision metrics.

Human-motion video generation has been a challenging task, primarily due to the difficulty inherent in learning human body movements. While some approaches have attempted to drive human-centric video generation explicitly through pose control, these methods typically rely on poses derived from existing videos, thereby lacking flexibility. To address this, we propose HumanDreamer, a decoupled human video generation framework that first generates diverse poses from text prompts and then leverages these poses to generate human-motion videos. Specifically, we propose MotionVid, the largest dataset for human-motion pose generation. Based on the dataset, we present MotionDiT, which is trained to generate structured human-motion poses from text prompts. Besides, a novel LAMA loss is introduced, which together contribute to a significant improvement in FID by 62.4%, along with respective enhancements in R-precision for top1, top2, and top3 by 41.8%, 26.3%, and 18.3%, thereby advancing both the Text-to-Pose control accuracy and FID metrics. Our experiments across various Pose-to-Video baselines demonstrate that the poses generated by our method can produce diverse and high-quality human-motion videos. Furthermore, our model can facilitate other downstream tasks, such as pose sequence prediction and 2D-3D motion lifting.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes