CVMay 28, 2025

Let Them Talk: Audio-Driven Multi-Person Conversational Video Generation

arXiv:2505.22647v153 citationsh-index: 21Has Code
Originality Highly original
AI Analysis

This work addresses the challenge of generating realistic multi-person conversational videos for applications in virtual communication and entertainment, representing a novel task rather than an incremental improvement.

The paper tackles the problem of generating synchronized multi-person conversational videos from audio inputs, addressing issues like incorrect audio-person binding and limited instruction-following. It introduces the MultiTalk framework, which achieves superior performance on multiple datasets, including talking head, talking body, and multi-person benchmarks.

Audio-driven human animation methods, such as talking head and talking body generation, have made remarkable progress in generating synchronized facial movements and appealing visual quality videos. However, existing methods primarily focus on single human animation and struggle with multi-stream audio inputs, facing incorrect binding problems between audio and persons. Additionally, they exhibit limitations in instruction-following capabilities. To solve this problem, in this paper, we propose a novel task: Multi-Person Conversational Video Generation, and introduce a new framework, MultiTalk, to address the challenges during multi-person generation. Specifically, for audio injection, we investigate several schemes and propose the Label Rotary Position Embedding (L-RoPE) method to resolve the audio and person binding problem. Furthermore, during training, we observe that partial parameter training and multi-task training are crucial for preserving the instruction-following ability of the base model. MultiTalk achieves superior performance compared to other methods on several datasets, including talking head, talking body, and multi-person datasets, demonstrating the powerful generation capabilities of our approach.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes