CLAILGMMSDASDec 9, 2024

Towards Controllable Speech Synthesis in the Era of Large Language Models: A Systematic Survey

arXiv:2412.06602v325 citationsh-index: 5Has CodeEMNLP
Originality Synthesis-oriented
AI Analysis

It addresses the need for a clear taxonomy and guidance for researchers and practitioners in the fast-evolving field of controllable text-to-speech, which is incremental as it synthesizes existing knowledge.

This survey tackles the problem of comprehensively reviewing controllable speech synthesis methods, providing the first systematic categorization of techniques from traditional to emerging approaches using natural language prompts.

Text-to-speech (TTS) has advanced from generating natural-sounding speech to enabling fine-grained control over attributes like emotion, timbre, and style. Driven by rising industrial demand and breakthroughs in deep learning, e.g., diffusion and large language models (LLMs), controllable TTS has become a rapidly growing research area. This survey provides the first comprehensive review of controllable TTS methods, from traditional control techniques to emerging approaches using natural language prompts. We categorize model architectures, control strategies, and feature representations, while also summarizing challenges, datasets, and evaluations in controllable TTS. This survey aims to guide researchers and practitioners by offering a clear taxonomy and highlighting future directions in this fast-evolving field. One can visit https://github.com/imxtx/awesome-controllabe-speech-synthesis for a comprehensive paper list and updates.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes