SDCLASOct 25, 2022

Adapitch: Adaption Multi-Speaker Text-to-Speech Conditioned on Pitch Disentangling with Untranscribed Data

arXiv:2210.13803v11 citationsh-index: 22
Originality Incremental advance
AI Analysis

This addresses the problem of adapting TTS systems to new speakers without transcribed data, which is incremental as it builds on existing disentanglement and self-supervised learning approaches.

The paper tackles multi-speaker text-to-speech adaptation using untranscribed data by proposing Adapitch, which uses self-supervised modules for text and mel representation and a supervised module conditioned on pitch, text, and speaker disentangling. The method achieved much better quality than baseline methods in experiments.

In this paper, we proposed Adapitch, a multi-speaker TTS method that makes adaptation of the supervised module with untranscribed data. We design two self supervised modules to train the text encoder and mel decoder separately with untranscribed data to enhance the representation of text and mel. To better handle the prosody information in a synthesized voice, a supervised TTS module is designed conditioned on content disentangling of pitch, text, and speaker. The training phase was separated into two parts, pretrained and fixed the text encoder and mel decoder with unsupervised mode, then the supervised mode on the disentanglement of TTS. Experiment results show that the Adaptich achieved much better quality than baseline methods.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes