SDCLASFeb 2, 2024

Natural language guidance of high-fidelity text-to-speech with synthetic annotations

arXiv:2402.01912v1125 citationsh-index: 2
Originality Incremental advance
AI Analysis

This work addresses the scalability and creative limitations in text-to-speech control for applications requiring intuitive user interaction, though it is incremental by building on existing natural language prompting approaches.

The paper tackles the problem of controlling speaker identity and style in text-to-speech models without relying on reference recordings or human-labeled descriptions, by proposing a scalable method for synthetic annotations and training a speech language model on a 45k hour dataset, resulting in high-fidelity speech generation across diverse accents, styles, and conditions using natural language conditioning.

Text-to-speech models trained on large-scale datasets have demonstrated impressive in-context learning capabilities and naturalness. However, control of speaker identity and style in these models typically requires conditioning on reference speech recordings, limiting creative applications. Alternatively, natural language prompting of speaker identity and style has demonstrated promising results and provides an intuitive method of control. However, reliance on human-labeled descriptions prevents scaling to large datasets. Our work bridges the gap between these two approaches. We propose a scalable method for labeling various aspects of speaker identity, style, and recording conditions. We then apply this method to a 45k hour dataset, which we use to train a speech language model. Furthermore, we propose simple methods for increasing audio fidelity, significantly outperforming recent work despite relying entirely on found data. Our results demonstrate high-fidelity speech generation in a diverse range of accents, prosodic styles, channel conditions, and acoustic conditions, all accomplished with a single model and intuitive natural language conditioning. Audio samples can be heard at https://text-description-to-speech.com/.

Code Implementations3 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes