SDAILGASSPJun 5, 2022

Zero-Shot Voice Conditioning for Denoising Diffusion TTS Models

Meta AI
arXiv:2206.02246v233 citationsh-index: 38
Originality Incremental advance
AI Analysis

This enables zero-shot voice conditioning for text-to-speech systems, allowing personalized speech generation from short audio samples, which is incremental as it builds on existing diffusion models.

The paper tackles the problem of generating speech in the voice of an unseen speaker using a pretrained denoising diffusion model, achieving voice similarity with accuracy comparable to state-of-the-art methods without requiring training.

We present a novel way of conditioning a pretrained denoising diffusion speech model to produce speech in the voice of a novel person unseen during training. The method requires a short (~3 seconds) sample from the target person, and generation is steered at inference time, without any training steps. At the heart of the method lies a sampling process that combines the estimation of the denoising model with a low-pass version of the new speaker's sample. The objective and subjective evaluations show that our sampling method can generate a voice similar to that of the target speaker in terms of frequency, with an accuracy comparable to state-of-the-art methods, and without training.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes