CLMMApr 29, 2024

UMETTS: A Unified Framework for Emotional Text-to-Speech Synthesis with Multimodal Prompts

CMUUW
arXiv:2404.18398v29 citationsh-index: 26ICASSP
Originality Highly original
AI Analysis

This addresses the challenge of generating emotionally expressive speech for human-computer interaction, representing an incremental advance with novel multimodal integration.

The paper tackles the problem of emotional text-to-speech synthesis by proposing UMETTS, a framework that uses multimodal prompts to improve emotion capture, resulting in significant gains in emotion accuracy and speech naturalness over traditional methods.

Emotional Text-to-Speech (E-TTS) synthesis has garnered significant attention in recent years due to its potential to revolutionize human-computer interaction. However, current E-TTS approaches often struggle to capture the intricacies of human emotions, primarily relying on oversimplified emotional labels or single-modality input. In this paper, we introduce the Unified Multimodal Prompt-Induced Emotional Text-to-Speech System (UMETTS), a novel framework that leverages emotional cues from multiple modalities to generate highly expressive and emotionally resonant speech. The core of UMETTS consists of two key components: the Emotion Prompt Alignment Module (EP-Align) and the Emotion Embedding-Induced TTS Module (EMI-TTS). (1) EP-Align employs contrastive learning to align emotional features across text, audio, and visual modalities, ensuring a coherent fusion of multimodal information. (2) Subsequently, EMI-TTS integrates the aligned emotional embeddings with state-of-the-art TTS models to synthesize speech that accurately reflects the intended emotions. Extensive evaluations show that UMETTS achieves significant improvements in emotion accuracy and speech naturalness, outperforming traditional E-TTS methods on both objective and subjective metrics.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes