CLSDASApr 12, 2019

Building a mixed-lingual neural TTS system with only monolingual data

arXiv:1904.06063v230 citations
Originality Synthesis-oriented
AI Analysis

This addresses a practical problem for deploying Chinese TTS systems with mixed-language content, but the approach appears incremental.

The paper tackles the challenge of synthesizing Chinese utterances with embedded English phrases using only monolingual data from a target speaker, focusing on speaker consistency and naturalness, but does not report concrete numerical results.

When deploying a Chinese neural text-to-speech (TTS) synthesis system, one of the challenges is to synthesize Chinese utterances with English phrases or words embedded. This paper looks into the problem in the encoder-decoder framework when only monolingual data from a target speaker is available. Specifically, we view the problem from two aspects: speaker consistency within an utterance and naturalness. We start the investigation with an Average Voice Model which is built from multi-speaker monolingual data, i.e. Mandarin and English data. On the basis of that, we look into speaker embedding for speaker consistency within an utterance and phoneme embedding for naturalness and intelligibility and study the choice of data for model training. We report the findings and discuss the challenges to build a mixed-lingual TTS system with only monolingual data.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes