SDAIASNov 4, 2024

EmoSphere++: Emotion-Controllable Zero-Shot Text-to-Speech via Emotion-Adaptive Spherical Vector

arXiv:2411.02625v232 citationsh-index: 6IEEE Transactions on Affective Computing
Originality Highly original
AI Analysis

This work addresses the challenge of generalizing emotional text-to-speech across different speakers and styles without human annotation, representing a novel method for a known bottleneck in the field.

The paper tackled the problem of generating emotional speech without extensive manual annotations by introducing EmoSphere++, a zero-shot text-to-speech model that controls emotional style and intensity using a novel emotion-adaptive spherical vector, achieving high-quality and expressive results in few sampling steps.

Emotional text-to-speech (TTS) technology has achieved significant progress in recent years; however, challenges remain owing to the inherent complexity of emotions and limitations of the available emotional speech datasets and models. Previous studies typically relied on limited emotional speech datasets or required extensive manual annotations, restricting their ability to generalize across different speakers and emotional styles. In this paper, we present EmoSphere++, an emotion-controllable zero-shot TTS model that can control emotional style and intensity to resemble natural human speech. We introduce a novel emotion-adaptive spherical vector that models emotional style and intensity without human annotation. Moreover, we propose a multi-level style encoder that can ensure effective generalization for both seen and unseen speakers. We also introduce additional loss functions to enhance the emotion transfer performance for zero-shot scenarios. We employ a conditional flow matching-based decoder to achieve high-quality and expressive emotional TTS in a few sampling steps. Experimental results demonstrate the effectiveness of the proposed framework.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes