CLAISDASAug 26, 2025

VibeVoice Technical Report

Tsinghua
arXiv:2508.19205v126 citationsh-index: 41Has Code
Originality Incremental advance
AI Analysis

This addresses the problem of generating authentic, long conversational speech for applications like audiobooks or virtual assistants, though it appears incremental as it builds on existing diffusion and tokenization methods.

The paper tackles long-form multi-speaker speech synthesis by introducing VibeVoice, which uses a novel continuous speech tokenizer to achieve 80x better data compression than Encodec while maintaining audio fidelity, enabling synthesis of up to 90-minute speech with 4 speakers.

This report presents VibeVoice, a novel model designed to synthesize long-form speech with multiple speakers by employing next-token diffusion, which is a unified method for modeling continuous data by autoregressively generating latent vectors via diffusion. To enable this, we introduce a novel continuous speech tokenizer that, when compared to the popular Encodec model, improves data compression by 80 times while maintaining comparable performance. The tokenizer effectively preserves audio fidelity while significantly boosting computational efficiency for processing long sequences. Thus, VibeVoice can synthesize long-form speech for up to 90 minutes (in a 64K context window length) with a maximum of 4 speakers, capturing the authentic conversational ``vibe'' and surpassing open-source and proprietary dialogue models.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes