SDCLASJan 10, 2025

PROEMO: Prompt-Driven Text-to-Speech Synthesis Based on Emotion and Intensity Control

arXiv:2501.06276v13 citationsh-index: 77
Originality Incremental advance
AI Analysis

This addresses the problem of generating expressive and variable speech for applications like virtual assistants or entertainment, but appears incremental as it builds on existing prompt-based and LLM methods.

The paper tackles the challenge of capturing emotion and style nuances in text-to-speech synthesis by introducing a prompt-driven approach for emotion and intensity control across multi-speakers, leveraging large language models to manipulate prosody while preserving content, and demonstrates its effectiveness through systematic exploration.

Speech synthesis has significantly advanced from statistical methods to deep neural network architectures, leading to various text-to-speech (TTS) models that closely mimic human speech patterns. However, capturing nuances such as emotion and style in speech synthesis is challenging. To address this challenge, we introduce an approach centered on prompt-based emotion control. The proposed architecture incorporates emotion and intensity control across multi-speakers. Furthermore, we leverage large language models (LLMs) to manipulate speech prosody while preserving linguistic content. Using embedding emotional cues, regulating intensity levels, and guiding prosodic variations with prompts, our approach infuses synthesized speech with human-like expressiveness and variability. Lastly, we demonstrate the effectiveness of our approach through a systematic exploration of the control mechanisms mentioned above.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes