ASLGSDJun 13, 2019

Adjusting Pleasure-Arousal-Dominance for Continuous Emotional Text-to-speech Synthesizer

arXiv:1906.05507v112 citations
Originality Incremental advance
AI Analysis

This work addresses the need for more nuanced emotional text-to-speech synthesis, offering a domain-specific improvement for applications requiring varied emotional expressions.

The paper tackled the problem of generating emotional speech with unlimited categories by using continuous Pleasure-Arousal-Dominance (PAD) dimensions instead of discrete emotions, resulting in an optimized neural synthesizer based on Tacotron with adjusted PAD values for speech synthesis.

Emotion is not limited to discrete categories of happy, sad, angry, fear, disgust, surprise, and so on. Instead, each emotion category is projected into a set of nearly independent dimensions, named pleasure (or valence), arousal, and dominance, known as PAD. The value of each dimension varies from -1 to 1, such that the neutral emotion is in the center with all-zero values. Training an emotional continuous text-to-speech (TTS) synthesizer on the independent dimensions provides the possibility of emotional speech synthesis with unlimited emotion categories. Our end-to-end neural speech synthesizer is based on the well-known Tacotron. Empirically, we have found the optimum network architecture for injecting the 3D PADs. Moreover, the PAD values are adjusted for the speech synthesis purpose.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes