ASAICLLGSDMar 5, 2024

NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models

arXiv:2403.03100v3356 citationsh-index: 26ICML
Originality Highly original
AI Analysis

This work addresses the problem of producing natural and expressive speech synthesis for applications like virtual assistants and audiobooks, representing a significant advancement rather than an incremental improvement.

The paper tackles the challenge of generating high-quality, natural-sounding speech in text-to-speech systems by factorizing speech into distinct attributes like content, prosody, and timbre, and using diffusion models to generate them individually. The result is NaturalSpeech 3, which outperforms state-of-the-art systems on quality, similarity, prosody, and intelligibility, achieving on-par quality with human recordings and scaling to 1B parameters and 200K hours of training data.

While recent large-scale text-to-speech (TTS) models have achieved significant progress, they still fall short in speech quality, similarity, and prosody. Considering speech intricately encompasses various attributes (e.g., content, prosody, timbre, and acoustic details) that pose significant challenges for generation, a natural idea is to factorize speech into individual subspaces representing different attributes and generate them individually. Motivated by it, we propose NaturalSpeech 3, a TTS system with novel factorized diffusion models to generate natural speech in a zero-shot way. Specifically, 1) we design a neural codec with factorized vector quantization (FVQ) to disentangle speech waveform into subspaces of content, prosody, timbre, and acoustic details; 2) we propose a factorized diffusion model to generate attributes in each subspace following its corresponding prompt. With this factorization design, NaturalSpeech 3 can effectively and efficiently model intricate speech with disentangled subspaces in a divide-and-conquer way. Experiments show that NaturalSpeech 3 outperforms the state-of-the-art TTS systems on quality, similarity, prosody, and intelligibility, and achieves on-par quality with human recordings. Furthermore, we achieve better performance by scaling to 1B parameters and 200K hours of training data.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes