SD AI CV MM ASJan 4

MM-Sonate: Multimodal Controllable Audio-Video Generation with Zero-Shot Voice Cloning

Chunyu Qiang, Jun Wang, Xiaopeng Wang, Kang Yin, Yuxin Guo, Xijuan Zeng, Nan Li, Zihan Li, Yuzhe Liang, Ziyu Zhang, Teng Ma, Yushen Chen

arXiv:2601.01568v18.14 citations

Originality Highly original

AI Analysis

This addresses the challenge of synchronized multisensory content creation for applications like virtual assistants or entertainment, with incremental improvements in control and fidelity.

The paper tackles the problem of joint audio-video generation with fine-grained acoustic control, particularly for identity-preserving speech, and achieves new state-of-the-art performance in benchmarks, significantly improving lip synchronization and speech intelligibility while matching specialized Text-to-Speech systems in voice cloning fidelity.

Joint audio-video generation aims to synthesize synchronized multisensory content, yet current unified models struggle with fine-grained acoustic control, particularly for identity-preserving speech. Existing approaches either suffer from temporal misalignment due to cascaded generation or lack the capability to perform zero-shot voice cloning within a joint synthesis framework. In this work, we present MM-Sonate, a multimodal flow-matching framework that unifies controllable audio-video joint generation with zero-shot voice cloning capabilities. Unlike prior works that rely on coarse semantic descriptions, MM-Sonate utilizes a unified instruction-phoneme input to enforce strict linguistic and temporal alignment. To enable zero-shot voice cloning, we introduce a timbre injection mechanism that effectively decouples speaker identity from linguistic content. Furthermore, addressing the limitations of standard classifier-free guidance in multimodal settings, we propose a noise-based negative conditioning strategy that utilizes natural noise priors to significantly enhance acoustic fidelity. Empirical evaluations demonstrate that MM-Sonate establishes new state-of-the-art performance in joint generation benchmarks, significantly outperforming baselines in lip synchronization and speech intelligibility, while achieving voice cloning fidelity comparable to specialized Text-to-Speech systems.

View on arXiv PDF

Similar