ASSDJun 12, 2020

Neural voice cloning with a few low-quality samples

arXiv:2006.06940v13 citations
Originality Incremental advance
AI Analysis

This work addresses voice cloning for applications needing personalized speech with minimal data, but it appears incremental as it builds on existing mimicking approaches.

The paper tackles speech synthesis from low-quality, limited samples by extracting speaker embeddings instead of training the entire text-to-speech system, and evaluates adaptation and speaker-encoder-based approaches on LibriTTS and VCTK datasets to assess speaker variety impact on clarity and similarity.

In this paper, we explore the possibility of speech synthesis from low quality found data using only limited number of samples of target speaker. We try to extract only the speaker embedding from found data of target speaker unlike previous works which tries to train the entire text-to-speech system on found data. Also, the two speaker mimicking approaches which are adaptation and speaker-encoder-based are applied on newly released LibriTTS dataset and previously released VCTK corpus to examine the impact of speaker variety on clarity and target-speaker-similarity .

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes