CLAISDASMar 7, 2023

Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec Language Modeling

Microsoft
arXiv:2303.03926v1266 citationsh-index: 102
Originality Incremental advance
AI Analysis

This addresses the problem of natural-sounding cross-lingual speech synthesis for applications like text-to-speech and speech translation, representing a strong incremental advance over prior work.

The paper tackles cross-lingual speech synthesis by proposing VALL-E X, a model that generates high-quality speech in a target language using a single source-language speech prompt, preserving the speaker's voice, emotion, and acoustic environment while reducing foreign accents.

We propose a cross-lingual neural codec language model, VALL-E X, for cross-lingual speech synthesis. Specifically, we extend VALL-E and train a multi-lingual conditional codec language model to predict the acoustic token sequences of the target language speech by using both the source language speech and the target language text as prompts. VALL-E X inherits strong in-context learning capabilities and can be applied for zero-shot cross-lingual text-to-speech synthesis and zero-shot speech-to-speech translation tasks. Experimental results show that it can generate high-quality speech in the target language via just one speech utterance in the source language as a prompt while preserving the unseen speaker's voice, emotion, and acoustic environment. Moreover, VALL-E X effectively alleviates the foreign accent problems, which can be controlled by a language ID. Audio samples are available at \url{https://aka.ms/vallex}.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes