SD CL ASOct 29, 2018

Speaking style adaptation in Text-To-Speech synthesis using Sequence-to-sequence models with attention

Bajibabu Bollepalli, Lauri Juvela, Paavo Alku

arXiv:1810.12051v12.91 citations

Originality Incremental advance

AI Analysis

This addresses the challenge of limited data for specific speaking styles in TTS, offering an incremental improvement for applications requiring robust speech synthesis in noisy environments.

The study tackled the problem of generating high-quality Lombard speech in text-to-speech synthesis by proposing a transfer learning method to adapt a sequence-to-sequence model from normal to Lombard style, and found that using a WaveNet vocoder significantly outperformed conventional systems in subjective evaluations.

Currently, there are increasing interests in text-to-speech (TTS) synthesis to use sequence-to-sequence models with attention. These models are end-to-end meaning that they learn both co-articulation and duration properties directly from text and speech. Since these models are entirely data-driven, they need large amounts of data to generate synthetic speech with good quality. However, in challenging speaking styles, such as Lombard speech, it is difficult to record sufficiently large speech corpora. Therefore, in this study we propose a transfer learning method to adapt a sequence-to-sequence based TTS system of normal speaking style to Lombard style. Moreover, we experiment with a WaveNet vocoder in synthesis of Lombard speech. We conducted subjective evaluations to assess the performance of the adapted TTS systems. The subjective evaluation results indicated that an adaptation system with the WaveNet vocoder clearly outperformed the conventional deep neural network based TTS system in synthesis of Lombard speech.

View on arXiv PDF

Similar