SDCLASJun 24, 2022

Exact Prosody Cloning in Zero-Shot Multispeaker Text-to-Speech

arXiv:2206.12229v225 citationsh-index: 38
Originality Incremental advance
AI Analysis

This work addresses the need for precise voice and prosody cloning in TTS applications, representing an incremental advancement by combining existing tasks with new normalization and alignment techniques.

The paper tackles the problem of cloning both a speaker's voice and the prosody of a reference audio in zero-shot multispeaker text-to-speech, achieving high similarity and no degradation in quality as shown by objective evaluation and human studies.

The cloning of a speaker's voice using an untranscribed reference sample is one of the great advances of modern neural text-to-speech (TTS) methods. Approaches for mimicking the prosody of a transcribed reference audio have also been proposed recently. In this work, we bring these two tasks together for the first time through utterance level normalization in conjunction with an utterance level speaker embedding. We further introduce a lightweight aligner for extracting fine-grained prosodic features, that can be finetuned on individual samples within seconds. We show that it is possible to clone the voice of a speaker as well as the prosody of a spoken reference independently without any degradation in quality and high similarity to both original voice and prosody, as our objective evaluation and human study show. All of our code and trained models are available, alongside static and interactive demos.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes