CLLGSDASMar 26, 2021

Continual Speaker Adaptation for Text-to-Speech Synthesis

arXiv:2103.14512v29 citations
Originality Incremental advance
AI Analysis

This addresses the incremental challenge of efficiently updating TTS models for new speakers without retraining, which is relevant for developers and users of speech synthesis systems.

The paper tackles the problem of catastrophic forgetting in multi-speaker Text-to-Speech synthesis when adding new speakers, showing that serial fine-tuning degrades performance for older speakers and proposing continual learning methods to mitigate this, with improvements demonstrated in extreme setups using small buffers.

Training a multi-speaker Text-to-Speech (TTS) model from scratch is computationally expensive and adding new speakers to the dataset requires the model to be re-trained. The naive solution of sequential fine-tuning of a model for new speakers can lead to poor performance of older speakers. This phenomenon is known as catastrophic forgetting. In this paper, we look at TTS modeling from a continual learning perspective, where the goal is to add new speakers without forgetting previous speakers. Therefore, we first propose an experimental setup and show that serial fine-tuning for new speakers can cause the forgetting of the earlier speakers. Then we exploit two well-known techniques for continual learning, namely experience replay and weight regularization. We reveal how one can mitigate the effect of degradation in speech synthesis diversity in sequential training of new speakers using these methods. Finally, we present a simple extension to experience replay to improve the results in extreme setups where we have access to very small buffers.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes