ASCLJul 12, 2025

ZipVoice-Dialog: Non-Autoregressive Spoken Dialogue Generation with Flow Matching

arXiv:2507.09318v18 citationsh-index: 12Has Code
Originality Incremental advance
AI Analysis

This addresses the challenge of realistic spoken dialogue generation for applications like virtual assistants, though it is incremental as it builds on flow matching with specific enhancements.

The paper tackled the problem of generating spoken dialogue by introducing ZipVoice-Dialog, a non-autoregressive model that improves inference speed and stability, achieving superior performance in intelligibility, speaker turn-taking accuracy, speaker similarity, and inference speed compared to existing models.

Generating spoken dialogue is more challenging than monologue text-to-speech (TTS) due to the need for realistic turn-taking and distinct speaker timbres. Existing spoken dialogue generation models, being auto-regressive, suffer from slow and unstable inference. To overcome these limitations, we introduce ZipVoice-Dialog, a non-autoregressive zero-shot spoken dialogue generation model built upon flow matching. Key designs include: 1) speaker-turn embeddings for precise speaker turn-taking; 2) a curriculum learning strategy for stable speech-text alignment; 3) specialized strategies to enable stereo dialogue generation. Additionally, recognizing the lack of open-source large-scale spoken dialogue datasets, we curated OpenDialog, a 6.8k-hour spoken dialogue dataset from in-the-wild speech data. Furthermore, we established a benchmark to comprehensively evaluate various models. Experimental results demonstrate that ZipVoice-Dialog achieves superior performance in intelligibility, speaker turn-taking accuracy, speaker similarity, and inference speed. Our codes, model checkpoints, demo samples, and the OpenDialog dataset are all publicly available at https://github.com/k2-fsa/ZipVoice.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes