SDAIASJan 20, 2022

Cross-Lingual Text-to-Speech Using Multi-Task Learning and Speaker Classifier Joint Training

arXiv:2201.08124v112 citations
AI Analysis

This addresses the challenge of maintaining speaker identity when synthesizing speech in multiple languages for monoglot speakers, which is an incremental improvement in cross-lingual TTS.

The paper tackled the problem of low speaker similarity in cross-lingual text-to-speech synthesis by proposing a multi-task learning framework and joint training with a speaker classifier, resulting in consistent improvements in speaker similarity for both seen and unseen speakers in subjective and objective evaluations.

In cross-lingual speech synthesis, the speech in various languages can be synthesized for a monoglot speaker. Normally, only the data of monoglot speakers are available for model training, thus the speaker similarity is relatively low between the synthesized cross-lingual speech and the native language recordings. Based on the multilingual transformer text-to-speech model, this paper studies a multi-task learning framework to improve the cross-lingual speaker similarity. To further improve the speaker similarity, joint training with a speaker classifier is proposed. Here, a scheme similar to parallel scheduled sampling is proposed to train the transformer model efficiently to avoid breaking the parallel training mechanism when introducing joint training. By using multi-task learning and speaker classifier joint training, in subjective and objective evaluations, the cross-lingual speaker similarity can be consistently improved for both the seen and unseen speakers in the training set.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes