CVMay 28

SalsaAgent: A multimodal embodied language model for interactive dance generation

Payam Jome Yazdian, Zoe Stanley, Angelica Lim

arXiv:2605.2921944.5h-index: 2

Predicted impact top 75% in CV · last 90 daysOriginality Incremental advance

AI Analysis

This work addresses the challenge of bidirectional, nonverbal human-robot interaction for socially aware robots and virtual agents, specifically in dance, which requires real-time reactivity and synchrony.

SalsaAgent generates expressive, full-body salsa dance motions in reaction to a human leader and music, using a multimodal LLM with discrete motion tokens and a two-stage token-to-diffusion pipeline. It achieves significant improvements over baselines in motion quality, coordination, and spatial behavior.

Interaction between humanoids involves bidirectional and nonverbal reactivity, coordination and synchrony. Toward socially aware robots and interactive virtual agents, we present SalsaAgent, a language model that generates expressive, full-body salsa dance motions in reaction to a human leader and against a contextual music backdrop. We formulate interaction as nonverbal motion token passing, extending the vocabulary of a large language model (LLM) to process discrete motion tokens, pairwise relation tokens, and audio. Our contributions include new tokens for full-body and motion relations, LLM fine-tuning using automatically derived text descriptions of skeleton dynamics for token grounding, and a two-stage token-to-diffusion pipeline. Subjective and objective evaluations demonstrate the effectiveness of our approach in terms of motion quality, music and partner coordination, and consistent two-person spatial behavior, with significant improvements over baselines.

View on arXiv PDF

Similar