AS SDDec 28, 2020

Building Multi lingual TTS using Cross Lingual Voice Conversion

arXiv:2012.14039v12.31 citations

Originality Incremental advance

AI Analysis

This work addresses the problem of building multilingual TTS systems for researchers and developers without the need for extensive multilingual speech corpora.

This paper proposes a cross-lingual Voice Conversion (VC) method that uses a single DNN model and PPGs from multiple ASR acoustic models to generate speech parameters. They applied this VC method to build a multilingual TTS system without a multilingual corpus, achieving a naturalness MOS of 3.28 and a similarity MOS of 2.77 using Approach 1.

In this paper we propose a new cross-lingual Voice Conversion (VC) approach which can generate all speech parameters (MCEP, LF0, BAP) from one DNN model using PPGs (Phonetic PosteriorGrams) extracted from inputted speech using several ASR acoustic models. Using the proposed VC method, we tried three different approaches to build a multilingual TTS system without recording a multilingual speech corpus. A listening test was carried out to evaluate both speech quality (naturalness) and voice similarity between converted speech and target speech. The results show that Approach 1 achieved the highest level of naturalness (3.28 MOS on a 5-point scale) and similarity (2.77 MOS).

View on arXiv PDF

Similar