SD LG ASJan 24, 2024

Scaling NVIDIA's Multi-speaker Multi-lingual TTS Systems with Zero-Shot TTS to Indic Languages

Akshit Arora, Rohan Badlani, Sungwon Kim, Rafael Valle, Bryan Catanzaro

arXiv:2401.13851v22.71 citations2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW)

Originality Synthesis-oriented

AI Analysis

This work addresses the challenge of creating high-quality TTS systems for Indic languages, which is incremental as it applies existing methods like RAD-MMM and P-Flow to new datasets and a specific competition.

The paper tackles the problem of developing multi-speaker, multi-lingual text-to-speech (TTS) systems for Indic languages, achieving competitive results with a mean opinion score of 4.4 and speaker similarity score of 3.62 in zero-shot TTS.

In this paper, we describe the TTS models developed by NVIDIA for the MMITS-VC (Multi-speaker, Multi-lingual Indic TTS with Voice Cloning) 2024 Challenge. In Tracks 1 and 2, we utilize RAD-MMM to perform few-shot TTS by training additionally on 5 minutes of target speaker data. In Track 3, we utilize P-Flow to perform zero-shot TTS by training on the challenge dataset as well as external datasets. We use HiFi-GAN vocoders for all submissions. RAD-MMM performs competitively on Tracks 1 and 2, while P-Flow ranks first on Track 3, with mean opinion score (MOS) 4.4 and speaker similarity score (SMOS) of 3.62.

View on arXiv PDF

Similar