CVMar 30, 2025

MoCha: Towards Movie-Grade Talking Character Synthesis

arXiv:2503.23307v128 citationsh-index: 16
Originality Incremental advance
AI Analysis

This addresses the need for automated film and animation generation by enabling multi-character conversations with cinematic coherence, though it is incremental in advancing video generation for character-driven storytelling.

The paper tackles the problem of generating full-body talking character animations from speech and text, achieving superior realism, expressiveness, controllability, and generalization as demonstrated through human preference studies and benchmark comparisons.

Recent advancements in video generation have achieved impressive motion realism, yet they often overlook character-driven storytelling, a crucial task for automated film, animation generation. We introduce Talking Characters, a more realistic task to generate talking character animations directly from speech and text. Unlike talking head, Talking Characters aims at generating the full portrait of one or more characters beyond the facial region. In this paper, we propose MoCha, the first of its kind to generate talking characters. To ensure precise synchronization between video and speech, we propose a speech-video window attention mechanism that effectively aligns speech and video tokens. To address the scarcity of large-scale speech-labeled video datasets, we introduce a joint training strategy that leverages both speech-labeled and text-labeled video data, significantly improving generalization across diverse character actions. We also design structured prompt templates with character tags, enabling, for the first time, multi-character conversation with turn-based dialogue-allowing AI-generated characters to engage in context-aware conversations with cinematic coherence. Extensive qualitative and quantitative evaluations, including human preference studies and benchmark comparisons, demonstrate that MoCha sets a new standard for AI-generated cinematic storytelling, achieving superior realism, expressiveness, controllability and generalization.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes