AS LG SDMar 6, 2021

Investigating on Incorporating Pretrained and Learnable Speaker Representations for Multi-Speaker Multi-Style Text-to-Speech

Chung-Ming Chien, Jheng-Hao Lin, Chien-yu Huang, Po-chun Hsu, Hung-yi Lee

arXiv:2103.04088v516.779 citationsHas Code

Originality Incremental advance

AI Analysis

This work addresses voice cloning for multiple speakers and styles with limited data, but it is incremental as it builds on existing methods like FastSpeech 2.

The paper tackled the few-shot multi-speaker multi-style voice cloning problem by integrating pretrained and learnable speaker representations, achieving second place in the ICASSP 2021 M2VoC challenge one-shot track.

The few-shot multi-speaker multi-style voice cloning task is to synthesize utterances with voice and speaking style similar to a reference speaker given only a few reference samples. In this work, we investigate different speaker representations and proposed to integrate pretrained and learnable speaker representations. Among different types of embeddings, the embedding pretrained by voice conversion achieves the best performance. The FastSpeech 2 model combined with both pretrained and learnable speaker representations shows great generalization ability on few-shot speakers and achieved 2nd place in the one-shot track of the ICASSP 2021 M2VoC challenge.

View on arXiv PDF Code

Similar