CLSDASMay 17, 2023

Controllable Speaking Styles Using a Large Language Model

arXiv:2305.10321v23 citations
Originality Incremental advance
AI Analysis

This work addresses the challenge of flexible and intuitive control over TTS outputs for applications like dialogue systems, though it is incremental as it builds on existing prompt-based methods.

The paper tackles the problem of controlling speaking styles in text-to-speech (TTS) without needing reference utterances or prompt-labelled speech data, by using a large language model (LLM) to suggest prosodic modifications based on natural language prompts, achieving a 50% appropriateness rating compared to 31% for a baseline.

Reference-based Text-to-Speech (TTS) models can generate multiple, prosodically-different renditions of the same target text. Such models jointly learn a latent acoustic space during training, which can be sampled from during inference. Controlling these models during inference typically requires finding an appropriate reference utterance, which is non-trivial. Large generative language models (LLMs) have shown excellent performance in various language-related tasks. Given only a natural language query text (the prompt), such models can be used to solve specific, context-dependent tasks. Recent work in TTS has attempted similar prompt-based control of novel speaking style generation. Those methods do not require a reference utterance and can, under ideal conditions, be controlled with only a prompt. But existing methods typically require a prompt-labelled speech corpus for jointly training a prompt-conditioned encoder. In contrast, we instead employ an LLM to directly suggest prosodic modifications for a controllable TTS model, using contextual information provided in the prompt. The prompt can be designed for a multitude of tasks. Here, we give two demonstrations: control of speaking style; prosody appropriate for a given dialogue context. The proposed method is rated most appropriate in 50% of cases vs. 31% for a baseline model.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes