ASAIAug 13, 2021

Enhancing audio quality for expressive Neural Text-to-Speech

arXiv:2108.06270v16 citations
Originality Incremental advance
AI Analysis

This addresses the trade-off between expressiveness and audio quality in TTS for applications requiring realistic expressive speech, though it is incremental as it builds on existing methods.

The paper tackled the problem of maintaining high signal quality in expressive neural text-to-speech systems, showing that combining techniques like tuning autoregressive loops, using GANs, and VAEs closed the naturalness gap by 39% in MUSHRA scores for an expressive celebrity voice.

Artificial speech synthesis has made a great leap in terms of naturalness as recent Text-to-Speech (TTS) systems are capable of producing speech with similar quality to human recordings. However, not all speaking styles are easy to model: highly expressive voices are still challenging even to recent TTS architectures since there seems to be a trade-off between expressiveness in a generated audio and its signal quality. In this paper, we present a set of techniques that can be leveraged to enhance the signal quality of a highly-expressive voice without the use of additional data. The proposed techniques include: tuning the autoregressive loop's granularity during training; using Generative Adversarial Networks in acoustic modelling; and the use of Variational Auto-Encoders in both the acoustic model and the neural vocoder. We show that, when combined, these techniques greatly closed the gap in perceived naturalness between the baseline system and recordings by 39% in terms of MUSHRA scores for an expressive celebrity voice.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes