SDCLASMar 18, 2022

AdaVocoder: Adaptive Vocoder for Custom Voice

arXiv:2203.09825v35 citationsh-index: 46
Originality Incremental advance
AI Analysis

This work addresses the challenge of creating personalized speech synthesis systems for users with minimal data, though it is incremental as it builds on existing GAN-based vocoder methods.

The paper tackles the problem of building custom voice synthesis systems with limited target speaker data by proposing an adaptive vocoder that uses a cross-domain consistency loss to prevent overfitting in few-shot scenarios. The results show that combining this adaptive vocoder with an adaptive acoustic model enables high-quality custom voice synthesis, as demonstrated on datasets like AISHELL3, CSMSC, and VXI-children.

Custom voice is to construct a personal speech synthesis system by adapting the source speech synthesis model to the target model through the target few recordings. The solution to constructing a custom voice is to combine an adaptive acoustic model with a robust vocoder. However, training a robust vocoder usually requires a multi-speaker dataset, which should include various age groups and various timbres, so that the trained vocoder can be used for unseen speakers. Collecting such a multi-speaker dataset is difficult, and the dataset distribution always has a mismatch with the distribution of the target speaker dataset. This paper proposes an adaptive vocoder for custom voice from another novel perspective to solve the above problems. The adaptive vocoder mainly uses a cross-domain consistency loss to solve the overfitting problem encountered by the GAN-based neural vocoder in the transfer learning of few-shot scenes. We construct two adaptive vocoders, AdaMelGAN and AdaHiFi-GAN. First, We pre-train the source vocoder model on AISHELL3 and CSMSC datasets, respectively. Then, fine-tune it on the internal dataset VXI-children with few adaptation data. The empirical results show that a high-quality custom voice system can be built by combining a adaptive acoustic model with a adaptive vocoder.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes