CL SD ASJun 10, 2024

Controlling Emotion in Text-to-Speech with Natural Language Prompts

arXiv:2406.06406v26.615 citationsHas Code

Originality Incremental advance

AI Analysis

This work addresses the need for intuitive emotional control in speech synthesis, though it is incremental as it builds on existing prompting and transformer-based methods.

The paper tackles the problem of controlling emotion in text-to-speech synthesis by using natural language prompts, achieving accurate emotion transfer while maintaining speaker identity, speech quality, and intelligibility.

In recent years, prompting has quickly become one of the standard ways of steering the outputs of generative machine learning models, due to its intuitive use of natural language. In this work, we propose a system conditioned on embeddings derived from an emotionally rich text that serves as prompt. Thereby, a joint representation of speaker and prompt embeddings is integrated at several points within a transformer-based architecture. Our approach is trained on merged emotional speech and text datasets and varies prompts in each training iteration to increase the generalization capabilities of the model. Objective and subjective evaluation results demonstrate the ability of the conditioned synthesis system to accurately transfer the emotions present in a prompt to speech. At the same time, precise tractability of speaker identities as well as overall high speech quality and intelligibility are maintained.

View on arXiv PDF Code

Similar