CLSDASAug 17, 2021

Combining speakers of multiple languages to improve quality of neural voices

arXiv:2108.07737v19 citations
Originality Incremental advance
AI Analysis

This work addresses data efficiency and cross-lingual capabilities for neural TTS systems, which is incremental as it builds on existing multi-speaker and multi-lingual approaches.

The paper tackled the problem of improving neural TTS quality with limited data and enabling cross-lingual synthesis by developing a multi-speaker, multi-lingual system. It achieved significantly better quality using less than 40% of speaker data compared to single-speaker models and cross-lingual synthesis within 80% of native quality in terms of Mean Opinion Score.

In this work, we explore multiple architectures and training procedures for developing a multi-speaker and multi-lingual neural TTS system with the goals of a) improving the quality when the available data in the target language is limited and b) enabling cross-lingual synthesis. We report results from a large experiment using 30 speakers in 8 different languages across 15 different locales. The system is trained on the same amount of data per speaker. Compared to a single-speaker model, when the suggested system is fine tuned to a speaker, it produces significantly better quality in most of the cases while it only uses less than $40\%$ of the speaker's data used to build the single-speaker model. In cross-lingual synthesis, on average, the generated quality is within $80\%$ of native single-speaker models, in terms of Mean Opinion Score.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes