CLSDASOct 21, 2022

Low-Resource Multilingual and Zero-Shot Multispeaker TTS

arXiv:2210.12223v1302 citationsh-index: 38Has Code
Originality Incremental advance
AI Analysis

This addresses the challenge of making TTS feasible for the vast majority of the world's languages that lack extensive data, though it is incremental in combining existing tasks.

The paper tackles the problem of enabling text-to-speech (TTS) for low-resource languages with minimal data, achieving the ability to learn a new language using just 5 minutes of training data while maintaining zero-shot voice cloning capabilities.

While neural methods for text-to-speech (TTS) have shown great advances in modeling multiple speakers, even in zero-shot settings, the amount of data needed for those approaches is generally not feasible for the vast majority of the world's over 6,000 spoken languages. In this work, we bring together the tasks of zero-shot voice cloning and multilingual low-resource TTS. Using the language agnostic meta learning (LAML) procedure and modifications to a TTS encoder, we show that it is possible for a system to learn speaking a new language using just 5 minutes of training data while retaining the ability to infer the voice of even unseen speakers in the newly learned language. We show the success of our proposed approach in terms of intelligibility, naturalness and similarity to target speaker using objective metrics as well as human studies and provide our code and trained models open source.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes