OpenDance: Multimodal Controllable 3D Dance Generation Using Large-scale Internet Data
This work addresses the challenge of generating controllable and diverse 3D dance animations for creative applications, representing an incremental advancement with a new dataset and method.
The paper tackles the problem of music-driven dance generation by addressing limitations in data and controllability, resulting in OpenDanceNet, a framework that achieves high-fidelity and flexible controllability as demonstrated in comprehensive experiments.
Music-driven dance generation offers significant creative potential yet faces considerable challenges. The absence of fine-grained multimodal data and the difficulty of flexible multi-conditional generation limit previous works on generation controllability and diversity in practice. In this paper, we build OpenDance5D, an extensive human dance dataset comprising over 101 hours across 14 distinct genres. Each sample has five modalities to facilitate robust cross-modal learning: RGB video, audio, 2D keypoints, 3D motion, and fine-grained textual descriptions from human arts. Furthermore, we propose OpenDanceNet, a unified masked modeling framework for controllable dance generation conditioned on music and arbitrary combinations of text prompts, keypoints, or character positioning. Comprehensive experiments demonstrate that OpenDanceNet achieves high-fidelity and flexible controllability.