ASCLSDAug 11, 2020

Modeling Prosodic Phrasing with Multi-Task Learning in Tacotron-based TTS

arXiv:2008.05284v121 citations
Originality Incremental advance
AI Analysis

This work addresses prosody issues in speech synthesis for languages like Chinese and Mongolian, but it is incremental as it builds on existing Tacotron methods.

The paper tackled prosodic phrasing errors in Tacotron-based text-to-speech synthesis for long sentences by extending the framework with multi-task learning to predict both Mel spectrum and phrase breaks, resulting in consistent voice quality improvements for Chinese and Mongolian systems.

Tacotron-based end-to-end speech synthesis has shown remarkable voice quality. However, the rendering of prosody in the synthesized speech remains to be improved, especially for long sentences, where prosodic phrasing errors can occur frequently. In this paper, we extend the Tacotron-based speech synthesis framework to explicitly model the prosodic phrase breaks. We propose a multi-task learning scheme for Tacotron training, that optimizes the system to predict both Mel spectrum and phrase breaks. To our best knowledge, this is the first implementation of multi-task learning for Tacotron based TTS with a prosodic phrasing model. Experiments show that our proposed training scheme consistently improves the voice quality for both Chinese and Mongolian systems.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes