ASLGSDMar 6, 2021

Investigating on Incorporating Pretrained and Learnable Speaker Representations for Multi-Speaker Multi-Style Text-to-Speech

arXiv:2103.04088v579 citations
Originality Incremental advance
AI Analysis

This work addresses voice cloning for multiple speakers and styles with limited data, but it is incremental as it builds on existing methods like FastSpeech 2.

The paper tackled the few-shot multi-speaker multi-style voice cloning problem by integrating pretrained and learnable speaker representations, achieving second place in the ICASSP 2021 M2VoC challenge one-shot track.

The few-shot multi-speaker multi-style voice cloning task is to synthesize utterances with voice and speaking style similar to a reference speaker given only a few reference samples. In this work, we investigate different speaker representations and proposed to integrate pretrained and learnable speaker representations. Among different types of embeddings, the embedding pretrained by voice conversion achieves the best performance. The FastSpeech 2 model combined with both pretrained and learnable speaker representations shows great generalization ability on few-shot speakers and achieved 2nd place in the one-shot track of the ICASSP 2021 M2VoC challenge.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes