Neural voice cloning with a few low-quality samples
This work addresses voice cloning for applications needing personalized speech with minimal data, but it appears incremental as it builds on existing mimicking approaches.
The paper tackles speech synthesis from low-quality, limited samples by extracting speaker embeddings instead of training the entire text-to-speech system, and evaluates adaptation and speaker-encoder-based approaches on LibriTTS and VCTK datasets to assess speaker variety impact on clarity and similarity.
In this paper, we explore the possibility of speech synthesis from low quality found data using only limited number of samples of target speaker. We try to extract only the speaker embedding from found data of target speaker unlike previous works which tries to train the entire text-to-speech system on found data. Also, the two speaker mimicking approaches which are adaptation and speaker-encoder-based are applied on newly released LibriTTS dataset and previously released VCTK corpus to examine the impact of speaker variety on clarity and target-speaker-similarity .