CLSDASJun 10, 2024

Controlling Emotion in Text-to-Speech with Natural Language Prompts

arXiv:2406.06406v215 citations
Originality Incremental advance
AI Analysis

This work addresses the need for intuitive emotional control in speech synthesis, though it is incremental as it builds on existing prompting and transformer-based methods.

The paper tackles the problem of controlling emotion in text-to-speech synthesis by using natural language prompts, achieving accurate emotion transfer while maintaining speaker identity, speech quality, and intelligibility.

In recent years, prompting has quickly become one of the standard ways of steering the outputs of generative machine learning models, due to its intuitive use of natural language. In this work, we propose a system conditioned on embeddings derived from an emotionally rich text that serves as prompt. Thereby, a joint representation of speaker and prompt embeddings is integrated at several points within a transformer-based architecture. Our approach is trained on merged emotional speech and text datasets and varies prompts in each training iteration to increase the generalization capabilities of the model. Objective and subjective evaluation results demonstrate the ability of the conditioned synthesis system to accurately transfer the emotions present in a prompt to speech. At the same time, precise tractability of speaker identities as well as overall high speech quality and intelligibility are maintained.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes