Zero-Shot Voice Conditioning for Denoising Diffusion TTS Models
This enables zero-shot voice conditioning for text-to-speech systems, allowing personalized speech generation from short audio samples, which is incremental as it builds on existing diffusion models.
The paper tackles the problem of generating speech in the voice of an unseen speaker using a pretrained denoising diffusion model, achieving voice similarity with accuracy comparable to state-of-the-art methods without requiring training.
We present a novel way of conditioning a pretrained denoising diffusion speech model to produce speech in the voice of a novel person unseen during training. The method requires a short (~3 seconds) sample from the target person, and generation is steered at inference time, without any training steps. At the heart of the method lies a sampling process that combines the estimation of the denoising model with a low-pass version of the new speaker's sample. The objective and subjective evaluations show that our sampling method can generate a voice similar to that of the target speaker in terms of frequency, with an accuracy comparable to state-of-the-art methods, and without training.