MusicInfuser: Making Video Diffusion Listen and Dance
This addresses the challenge of creating music-aligned dance videos for content creators and entertainment applications, representing an incremental improvement over prior methods.
The paper tackles the problem of generating dance videos synchronized to music by adapting existing video diffusion models with lightweight cross-attention and low-rank adapters, achieving high-quality results without requiring motion capture data.
We introduce MusicInfuser, an approach for generating high-quality dance videos that are synchronized to a specified music track. Rather than attempting to design and train a new multimodal audio-video model, we show how existing video diffusion models can be adapted to align with musical inputs by introducing lightweight music-video cross-attention and a low-rank adapter. Unlike prior work requiring motion capture data, our approach fine-tunes only on dance videos. MusicInfuser achieves high-quality music-driven video generation while preserving the flexibility and generative capabilities of the underlying models. We introduce an evaluation framework using Video-LLMs to assess multiple dimensions of dance generation quality. The project page and code are available at https://susunghong.github.io/MusicInfuser.