SDASAug 2, 2021

Speaker Adaptation with Continuous Vocoder-based DNN-TTS

arXiv:2108.01154v14 citations
Originality Synthesis-oriented
AI Analysis

This work addresses speaker adaptation for TTS applications requiring real-time synthesis, but it is incremental as it builds on existing vocoder methods.

The paper tackled the problem of speaker adaptation in text-to-speech synthesis using a continuous vocoder to achieve low computational complexity, showing that adaptation is feasible with 400 utterances (about 14 minutes) and yields quality similar to a baseline WORLD vocoder.

Traditional vocoder-based statistical parametric speech synthesis can be advantageous in applications that require low computational complexity. Recent neural vocoders, which can produce high naturalness, still cannot fulfill the requirement of being real-time during synthesis. In this paper, we experiment with our earlier continuous vocoder, in which the excitation is modeled with two one-dimensional parameters: continuous F0 and Maximum Voiced Frequency. We show on the data of 9 speakers that an average voice can be trained for DNN-TTS, and speaker adaptation is feasible 400 utterances (about 14 minutes). Objective experiments support that the quality of speaker adaptation with Continuous Vocoder-based DNN-TTS is similar to the quality of the speaker adaptation with a WORLD Vocoder-based baseline.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes