SD AI ASJan 20, 2022

Cross-Lingual Text-to-Speech Using Multi-Task Learning and Speaker Classifier Joint Training

arXiv:2201.08124v112 citations

AI Analysis

This addresses the challenge of maintaining speaker identity when synthesizing speech in multiple languages for monoglot speakers, which is an incremental improvement in cross-lingual TTS.

The paper tackled the problem of low speaker similarity in cross-lingual text-to-speech synthesis by proposing a multi-task learning framework and joint training with a speaker classifier, resulting in consistent improvements in speaker similarity for both seen and unseen speakers in subjective and objective evaluations.

In cross-lingual speech synthesis, the speech in various languages can be synthesized for a monoglot speaker. Normally, only the data of monoglot speakers are available for model training, thus the speaker similarity is relatively low between the synthesized cross-lingual speech and the native language recordings. Based on the multilingual transformer text-to-speech model, this paper studies a multi-task learning framework to improve the cross-lingual speaker similarity. To further improve the speaker similarity, joint training with a speaker classifier is proposed. Here, a scheme similar to parallel scheduled sampling is proposed to train the transformer model efficiently to avoid breaking the parallel training mechanism when introducing joint training. By using multi-task learning and speaker classifier joint training, in subjective and objective evaluations, the cross-lingual speaker similarity can be consistently improved for both the seen and unseen speakers in the training set.

View on arXiv PDF

Similar