ASAIHCLGSDApr 23, 2023

DiffVoice: Text-to-Speech with Latent Diffusion

arXiv:2304.11750v130 citationsh-index: 68
Originality Highly original
AI Analysis

This work addresses the problem of generating high-quality, natural-sounding speech for applications like TTS and speech editing, representing a significant advancement rather than an incremental improvement.

The authors tackled text-to-speech synthesis by developing DiffVoice, a model based on latent diffusion that encodes speech into phoneme-rate latents and jointly models duration and representation, achieving state-of-the-art performance in naturalness on LJSpeech and LibriTTS datasets and excelling in text-based speech editing and zero-shot adaptation.

In this work, we present DiffVoice, a novel text-to-speech model based on latent diffusion. We propose to first encode speech signals into a phoneme-rate latent representation with a variational autoencoder enhanced by adversarial training, and then jointly model the duration and the latent representation with a diffusion model. Subjective evaluations on LJSpeech and LibriTTS datasets demonstrate that our method beats the best publicly available systems in naturalness. By adopting recent generative inverse problem solving algorithms for diffusion models, DiffVoice achieves the state-of-the-art performance in text-based speech editing, and zero-shot adaptation.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes