SDCLASSep 21, 2023

A Discourse-level Multi-scale Prosodic Model for Fine-grained Emotion Analysis

arXiv:2309.11849v13 citationsh-index: 40
Originality Incremental advance
AI Analysis

This work addresses the problem of generating expressive speech for applications like audiobooks, though it is incremental as it builds on existing style transfer and prosodic modeling techniques.

The paper tackles predicting prosodic features for fine-grained emotion analysis from discourse-level text, proposing a model that uses multi-scale text to predict phoneme-level and global prosodic embeddings, and shows it improves coherence and user experience, with synthesized speech outperforming style transfer in some user evaluations.

This paper explores predicting suitable prosodic features for fine-grained emotion analysis from the discourse-level text. To obtain fine-grained emotional prosodic features as predictive values for our model, we extract a phoneme-level Local Prosody Embedding sequence (LPEs) and a Global Style Embedding as prosodic speech features from the speech with the help of a style transfer model. We propose a Discourse-level Multi-scale text Prosodic Model (D-MPM) that exploits multi-scale text to predict these two prosodic features. The proposed model can be used to analyze better emotional prosodic features and thus guide the speech synthesis model to synthesize more expressive speech. To quantitatively evaluate the proposed model, we contribute a new and large-scale Discourse-level Chinese Audiobook (DCA) dataset with more than 13,000 utterances annotated sequences to evaluate the proposed model. Experimental results on the DCA dataset show that the multi-scale text information effectively helps to predict prosodic features, and the discourse-level text improves both the overall coherence and the user experience. More interestingly, although we aim at the synthesis effect of the style transfer model, the synthesized speech by the proposed text prosodic analysis model is even better than the style transfer from the original speech in some user evaluation indicators.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes