ASAICLLGMar 14, 2023

Controllable Prosody Generation With Partial Inputs

arXiv:2303.09446v23 citationsh-index: 6
AI Analysis

It addresses the challenge of inefficient and imprecise user control in generative models for text-to-speech synthesis, offering a more interactive and effective solution.

The paper tackles the problem of enabling human-in-the-loop control for prosody generation in text-to-speech synthesis by introducing a framework where users provide partial inputs to guide the model, resulting in significant listener preference improvements (4:1 ratio) with minimal inputs (~4 values).

We address the problem of human-in-the-loop control for generating prosody in the context of text-to-speech synthesis. Controlling prosody is challenging because existing generative models lack an efficient interface through which users can modify the output quickly and precisely. To solve this, we introduce a novel framework whereby the user provides partial inputs and the generative model generates the missing features. We propose a model that is specifically designed to encode partial prosodic features and output complete audio. We show empirically that our model displays two essential qualities of a human-in-the-loop control mechanism: efficiency and robustness. With even a very small number of input values (~4), our model enables users to improve the quality of the output significantly in terms of listener preference (4:1).

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes