SDLGASJan 24, 2024

Scaling NVIDIA's Multi-speaker Multi-lingual TTS Systems with Zero-Shot TTS to Indic Languages

arXiv:2401.13851v21 citations2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW)
Originality Synthesis-oriented
AI Analysis

This work addresses the challenge of creating high-quality TTS systems for Indic languages, which is incremental as it applies existing methods like RAD-MMM and P-Flow to new datasets and a specific competition.

The paper tackles the problem of developing multi-speaker, multi-lingual text-to-speech (TTS) systems for Indic languages, achieving competitive results with a mean opinion score of 4.4 and speaker similarity score of 3.62 in zero-shot TTS.

In this paper, we describe the TTS models developed by NVIDIA for the MMITS-VC (Multi-speaker, Multi-lingual Indic TTS with Voice Cloning) 2024 Challenge. In Tracks 1 and 2, we utilize RAD-MMM to perform few-shot TTS by training additionally on 5 minutes of target speaker data. In Track 3, we utilize P-Flow to perform zero-shot TTS by training on the challenge dataset as well as external datasets. We use HiFi-GAN vocoders for all submissions. RAD-MMM performs competitively on Tracks 1 and 2, while P-Flow ranks first on Track 3, with mean opinion score (MOS) 4.4 and speaker similarity score (SMOS) of 3.62.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes