SDCLASOct 29, 2018

Speaking style adaptation in Text-To-Speech synthesis using Sequence-to-sequence models with attention

arXiv:1810.12051v11 citations
Originality Incremental advance
AI Analysis

This addresses the challenge of limited data for specific speaking styles in TTS, offering an incremental improvement for applications requiring robust speech synthesis in noisy environments.

The study tackled the problem of generating high-quality Lombard speech in text-to-speech synthesis by proposing a transfer learning method to adapt a sequence-to-sequence model from normal to Lombard style, and found that using a WaveNet vocoder significantly outperformed conventional systems in subjective evaluations.

Currently, there are increasing interests in text-to-speech (TTS) synthesis to use sequence-to-sequence models with attention. These models are end-to-end meaning that they learn both co-articulation and duration properties directly from text and speech. Since these models are entirely data-driven, they need large amounts of data to generate synthetic speech with good quality. However, in challenging speaking styles, such as Lombard speech, it is difficult to record sufficiently large speech corpora. Therefore, in this study we propose a transfer learning method to adapt a sequence-to-sequence based TTS system of normal speaking style to Lombard style. Moreover, we experiment with a WaveNet vocoder in synthesis of Lombard speech. We conducted subjective evaluations to assess the performance of the adapted TTS systems. The subjective evaluation results indicated that an adaptation system with the WaveNet vocoder clearly outperformed the conventional deep neural network based TTS system in synthesis of Lombard speech.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes