CLSDASApr 6, 2018

Expressive Speech Synthesis via Modeling Expressions with Variational Autoencoder

arXiv:1804.02135v3145 citations
Originality Incremental advance
AI Analysis

This work addresses the challenge of expressive speech synthesis for applications requiring nuanced vocal output, though it is incremental as it builds on existing models.

The paper tackled the problem of making neural autoregressive speech synthesis more expressive by modeling global characteristics like speaker individuality and speaking style without labels, and found that combining VoiceLoop with a Variational Autoencoder improved speech quality and enabled unsupervised control of expressions.

Recent advances in neural autoregressive models have improve the performance of speech synthesis (SS). However, as they lack the ability to model global characteristics of speech (such as speaker individualities or speaking styles), particularly when these characteristics have not been labeled, making neural autoregressive SS systems more expressive is still an open issue. In this paper, we propose to combine VoiceLoop, an autoregressive SS model, with Variational Autoencoder (VAE). This approach, unlike traditional autoregressive SS systems, uses VAE to model the global characteristics explicitly, enabling the expressiveness of the synthesized speech to be controlled in an unsupervised manner. Experiments using the VCTK and Blizzard2012 datasets show the VAE helps VoiceLoop to generate higher quality speech and to control the expressions in its synthesized speech by incorporating global characteristics into the speech generating process.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes