SD AI MM ASJun 27, 2023

CASEIN: Cascading Explicit and Implicit Control for Fine-grained Emotion Intensity Regulation

Yuhao Cui, Xiongwei Wang, Zhongzhou Zhao, Wei Zhou, Haiqing Chen

arXiv:2307.00020v12.32 citationsh-index: 19

Originality Incremental advance

AI Analysis

This work addresses a specific bottleneck in speech synthesis for applications requiring precise emotional expression, representing an incremental advancement in controllability.

The paper tackles the problem of inaccurate and unsmooth emotion probability predictions in fine-grained intensity regulation for speech synthesis, which reduces controllability and naturalness, by proposing CASEIN, a framework that uses disentangled emotion manifolds to learn implicit representations, resulting in improved performance over existing methods and enabling mixed intensity control for multiple emotions.

Existing fine-grained intensity regulation methods rely on explicit control through predicted emotion probabilities. However, these high-level semantic probabilities are often inaccurate and unsmooth at the phoneme level, leading to bias in learning. Especially when we attempt to mix multiple emotion intensities for specific phonemes, resulting in markedly reduced controllability and naturalness of the synthesis. To address this issue, we propose the CAScaded Explicit and Implicit coNtrol framework (CASEIN), which leverages accurate disentanglement of emotion manifolds from the reference speech to learn the implicit representation at a lower semantic level. This representation bridges the semantical gap between explicit probabilities and the synthesis model, reducing bias in learning. In experiments, our CASEIN surpasses existing methods in both controllability and naturalness. Notably, we are the first to achieve fine-grained control over the mixed intensity of multiple emotions.

View on arXiv PDF

Similar