ASCLLGSDSep 14, 2019

Bootstrapping non-parallel voice conversion from speaker-adaptive text-to-speech

arXiv:1909.06532v118 citations
Originality Incremental advance
AI Analysis

This work addresses the challenge of data scarcity in voice conversion for speech processing applications, offering a practical solution with incremental improvements.

The paper tackles the problem of building a non-parallel voice conversion system by bootstrapping from a pretrained speaker-adaptive text-to-speech model, enabling competitive performance with small target speaker data and adaptation to unseen languages.

Voice conversion (VC) and text-to-speech (TTS) are two tasks that share a similar objective, generating speech with a target voice. However, they are usually developed independently under vastly different frameworks. In this paper, we propose a methodology to bootstrap a VC system from a pretrained speaker-adaptive TTS model and unify the techniques as well as the interpretations of these two tasks. Moreover by offloading the heavy data demand to the training stage of the TTS model, our VC system can be built using a small amount of target speaker speech data. It also opens up the possibility of using speech in a foreign unseen language to build the system. Our subjective evaluations show that the proposed framework is able to not only achieve competitive performance in the standard intra-language scenario but also adapt and convert using speech utterances in an unseen language.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes