GRCVMMSDAug 23, 2025

MDD: A Dataset for Text-and-Music Conditioned Duet Dance Generation

arXiv:2508.16911v14 citationsh-index: 7
Originality Incremental advance
AI Analysis

This provides a new benchmark for generating duet dances from text and music, addressing a specific multimodal AI problem in creative domains.

The authors introduced the Multimodal DuetDance (MDD) dataset, comprising 620 minutes of motion capture data with over 10K text descriptions, to tackle text-and-music conditioned 3D duet dance generation, and they proposed two novel tasks with baseline evaluations to support research.

We introduce Multimodal DuetDance (MDD), a diverse multimodal benchmark dataset designed for text-controlled and music-conditioned 3D duet dance motion generation. Our dataset comprises 620 minutes of high-quality motion capture data performed by professional dancers, synchronized with music, and detailed with over 10K fine-grained natural language descriptions. The annotations capture a rich movement vocabulary, detailing spatial relationships, body movements, and rhythm, making MDD the first dataset to seamlessly integrate human motions, music, and text for duet dance generation. We introduce two novel tasks supported by our dataset: (1) Text-to-Duet, where given music and a textual prompt, both the leader and follower dance motion are generated (2) Text-to-Dance Accompaniment, where given music, textual prompt, and the leader's motion, the follower's motion is generated in a cohesive, text-aligned manner. We include baseline evaluations on both tasks to support future research.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes