MM-Sonate: Multimodal Controllable Audio-Video Generation with Zero-Shot Voice Cloning

arXiv:2601.01568v14 citations
Originality Highly original
AI Analysis

This addresses the challenge of synchronized multisensory content creation for applications like virtual assistants or entertainment, with incremental improvements in control and fidelity.

The paper tackles the problem of joint audio-video generation with fine-grained acoustic control, particularly for identity-preserving speech, and achieves new state-of-the-art performance in benchmarks, significantly improving lip synchronization and speech intelligibility while matching specialized Text-to-Speech systems in voice cloning fidelity.

Joint audio-video generation aims to synthesize synchronized multisensory content, yet current unified models struggle with fine-grained acoustic control, particularly for identity-preserving speech. Existing approaches either suffer from temporal misalignment due to cascaded generation or lack the capability to perform zero-shot voice cloning within a joint synthesis framework. In this work, we present MM-Sonate, a multimodal flow-matching framework that unifies controllable audio-video joint generation with zero-shot voice cloning capabilities. Unlike prior works that rely on coarse semantic descriptions, MM-Sonate utilizes a unified instruction-phoneme input to enforce strict linguistic and temporal alignment. To enable zero-shot voice cloning, we introduce a timbre injection mechanism that effectively decouples speaker identity from linguistic content. Furthermore, addressing the limitations of standard classifier-free guidance in multimodal settings, we propose a noise-based negative conditioning strategy that utilizes natural noise priors to significantly enhance acoustic fidelity. Empirical evaluations demonstrate that MM-Sonate establishes new state-of-the-art performance in joint generation benchmarks, significantly outperforming baselines in lip synchronization and speech intelligibility, while achieving voice cloning fidelity comparable to specialized Text-to-Speech systems.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes