Using IPA-Based Tacotron for Data Efficient Cross-Lingual Speaker Adaptation and Pronunciation Enhancement
This addresses data efficiency in cross-lingual speaker adaptation for TTS applications, though it is incremental as it builds on existing Tacotron models.
The paper tackles the problem of fine-tuning neural Text-to-Speech models for new speakers or languages with limited data, achieving effective adaptation using only 20 minutes of data through minor modifications to a Tacotron model.
Recent neural Text-to-Speech (TTS) models have been shown to perform very well when enough data is available. However, fine-tuning them for new speakers or languages is not straightforward in a low-resource setup. In this paper, we show that by applying minor modifications to a Tacotron model, one can transfer an existing TTS model for new speakers from the same or a different language using only 20 minutes of data. For this purpose, we first introduce a base multi-lingual Tacotron with language-agnostic input, then demonstrate how transfer learning is done for different scenarios of speaker adaptation without exploiting any pre-trained speaker encoder or code-switching technique. We evaluate the transferred model in both subjective and objective ways.