SD AI CV MM ASJul 7, 2025

EXPOTION: Facial Expression and Motion Control for Multimodal Music Generation

arXiv:2507.04955v14.01 citationsh-index: 24ISMIR

Originality Incremental advance

AI Analysis

This work addresses the challenge of multimodal music generation for applications in interactive media and entertainment, though it appears incremental as it builds on existing text-to-music models with parameter-efficient fine-tuning.

The authors tackled the problem of generating music synchronized with visual cues by proposing Expotion, a model that uses facial expressions, upper-body motion, and text prompts to produce expressive music, achieving enhanced quality in musicality, creativity, and temporal alignment compared to existing methods.

We propose Expotion (Facial Expression and Motion Control for Multimodal Music Generation), a generative model leveraging multimodal visual controls - specifically, human facial expressions and upper-body motion - as well as text prompts to produce expressive and temporally accurate music. We adopt parameter-efficient fine-tuning (PEFT) on the pretrained text-to-music generation model, enabling fine-grained adaptation to the multimodal controls using a small dataset. To ensure precise synchronization between video and music, we introduce a temporal smoothing strategy to align multiple modalities. Experiments demonstrate that integrating visual features alongside textual descriptions enhances the overall quality of generated music in terms of musicality, creativity, beat-tempo consistency, temporal alignment with the video, and text adherence, surpassing both proposed baselines and existing state-of-the-art video-to-music generation models. Additionally, we introduce a novel dataset consisting of 7 hours of synchronized video recordings capturing expressive facial and upper-body gestures aligned with corresponding music, providing significant potential for future research in multimodal and interactive music generation.

View on arXiv PDF

Similar