SDASSep 23, 2021

Unet-TTS: Improving Unseen Speaker and Style Transfer in One-shot Voice Cloning

arXiv:2109.11115v318 citations
Originality Incremental advance
AI Analysis

This addresses the problem of accurate voice and style transfer in speech synthesis for applications like personalized TTS, though it appears incremental as it builds on existing U-net structures.

The paper tackled the challenge of one-shot voice cloning for unseen speakers and styles, proposing Unet-TTS, which outperformed existing methods on an unseen emotional corpus in similarity evaluations.

One-shot voice cloning aims to transform speaker voice and speaking style in speech synthesized from a text-to-speech (TTS) system, where only a shot recording from the target reference speech can be used. Out-of-domain transfer is still a challenging task, and one important aspect that impacts the accuracy and similarity of synthetic speech is the conditional representations carrying speaker or style cues extracted from the limited references. In this paper, we present a novel one-shot voice cloning algorithm called Unet-TTS that has good generalization ability for unseen speakers and styles. Based on a skip-connected U-net structure, the new model can efficiently discover speaker-level and utterance-level spectral feature details from the reference audio, enabling accurate inference of complex acoustic characteristics as well as imitation of speaking styles into the synthetic speech. According to both subjective and objective evaluations of similarity, the new model outperforms both speaker embedding and unsupervised style modeling (GST) approaches on an unseen emotional corpus.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes