ASCLLGSDJan 13, 2021

Whispered and Lombard Neural Speech Synthesis

arXiv:2101.05313v117 citations
Originality Incremental advance
AI Analysis

This addresses the problem of context-aware speech synthesis for users in varied environments, but it is incremental as it builds on existing methods like Tacotron.

The paper tackled generating different speaking styles (normal, Lombard, whisper) in text-to-speech with limited data, showing that pre-training/fine-tuning produces high-quality speech and that a speaker verification model can encode styles for synthesis, with synthetic Lombard speech significantly improving intelligibility gain.

It is desirable for a text-to-speech system to take into account the environment where synthetic speech is presented, and provide appropriate context-dependent output to the user. In this paper, we present and compare various approaches for generating different speaking styles, namely, normal, Lombard, and whisper speech, using only limited data. The following systems are proposed and assessed: 1) Pre-training and fine-tuning a model for each style. 2) Lombard and whisper speech conversion through a signal processing based approach. 3) Multi-style generation using a single model based on a speaker verification model. Our mean opinion score and AB preference listening tests show that 1) we can generate high quality speech through the pre-training/fine-tuning approach for all speaking styles. 2) Although our speaker verification (SV) model is not explicitly trained to discriminate different speaking styles, and no Lombard and whisper voice is used for pre-training this system, the SV model can be used as a style encoder for generating different style embeddings as input for the Tacotron system. We also show that the resulting synthetic Lombard speech has a significant positive impact on intelligibility gain.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes