ASSDNov 24, 2021

One-shot Voice Conversion For Style Transfer Based On Speaker Adaptation

arXiv:2111.12277v22 citations
Originality Incremental advance
AI Analysis

This work addresses the problem of generating expressive and speaker-similar voice conversions with minimal data for applications in speech synthesis and audio editing, representing an incremental improvement over existing methods.

The paper tackled the challenge of one-shot voice conversion for style transfer, where using only one utterance for training leads to overfitting and poor speaker similarity and expressiveness; the proposed approach, based on speaker adaptation with weight regularization and prosody modules, achieved superior style and speaker similarity compared to state-of-the-art systems while maintaining good speech quality.

One-shot style transfer is a challenging task, since training on one utterance makes model extremely easy to over-fit to training data and causes low speaker similarity and lack of expressiveness. In this paper, we build on the recognition-synthesis framework and propose a one-shot voice conversion approach for style transfer based on speaker adaptation. First, a speaker normalization module is adopted to remove speaker-related information in bottleneck features extracted by ASR. Second, we adopt weight regularization in the adaptation process to prevent over-fitting caused by using only one utterance from target speaker as training data. Finally, to comprehensively decouple the speech factors, i.e., content, speaker, style, and transfer source style to the target, a prosody module is used to extract prosody representation. Experiments show that our approach is superior to the state-of-the-art one-shot VC systems in terms of style and speaker similarity; additionally, our approach also maintains good speech quality.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes