CLSDASJan 5, 2023

Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

Microsoft
arXiv:2301.02111v11207 citationsh-index: 102
Originality Highly original
AI Analysis

This work addresses personalized speech synthesis for applications like voice assistants or accessibility tools, offering a novel zero-shot approach that is not incremental.

The paper tackles text-to-speech synthesis by training a neural codec language model (Vall-E) on 60K hours of English speech, treating it as a conditional language modeling task, and achieves high-quality personalized speech with only a 3-second speaker recording, significantly outperforming state-of-the-art zero-shot TTS systems in naturalness and similarity.

We introduce a language modeling approach for text to speech synthesis (TTS). Specifically, we train a neural codec language model (called Vall-E) using discrete codes derived from an off-the-shelf neural audio codec model, and regard TTS as a conditional language modeling task rather than continuous signal regression as in previous work. During the pre-training stage, we scale up the TTS training data to 60K hours of English speech which is hundreds of times larger than existing systems. Vall-E emerges in-context learning capabilities and can be used to synthesize high-quality personalized speech with only a 3-second enrolled recording of an unseen speaker as an acoustic prompt. Experiment results show that Vall-E significantly outperforms the state-of-the-art zero-shot TTS system in terms of speech naturalness and speaker similarity. In addition, we find Vall-E could preserve the speaker's emotion and acoustic environment of the acoustic prompt in synthesis. See https://aka.ms/valle for demos of our work.

Code Implementations7 repos

Data from Papers with Code (CC-BY-SA-4.0)

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes