SD CL LG ASNov 17, 2021

Cross-lingual Low Resource Speaker Adaptation Using Phonological Features

Georgia Maniati, Nikolaos Ellinas, Konstantinos Markopoulos, Georgios Vamvoukakis, June Sig Sung, Hyoungmin Park, Aimilios Chalamandaris, Pirros Tsiakoulis

arXiv:2111.09075v110.816 citations

Originality Incremental advance

AI Analysis

This addresses low-resource speaker adaptation for multilingual TTS, enabling voice preservation with minimal data, though it is incremental on prior phonological feature methods.

The paper tackles cross-lingual speaker adaptation in text-to-speech using phonological features, achieving high speaker similarity and naturalness with as few as 2 to 32 adaptation utterances, comparable to existing literature.

The idea of using phonological features instead of phonemes as input to sequence-to-sequence TTS has been recently proposed for zero-shot multilingual speech synthesis. This approach is useful for code-switching, as it facilitates the seamless uttering of foreign text embedded in a stream of native text. In our work, we train a language-agnostic multispeaker model conditioned on a set of phonologically derived features common across different languages, with the goal of achieving cross-lingual speaker adaptation. We first experiment with the effect of language phonological similarity on cross-lingual TTS of several source-target language combinations. Subsequently, we fine-tune the model with very limited data of a new speaker's voice in either a seen or an unseen language, and achieve synthetic speech of equal quality, while preserving the target speaker's identity. With as few as 32 and 8 utterances of target speaker data, we obtain high speaker similarity scores and naturalness comparable to the corresponding literature. In the extreme case of only 2 available adaptation utterances, we find that our model behaves as a few-shot learner, as the performance is similar in both the seen and unseen adaptation language scenarios.

View on arXiv PDF

Similar