SDCLASMar 18, 2022

Improve few-shot voice cloning using multi-modal learning

arXiv:2203.09708v112 citationsh-index: 9
Originality Incremental advance
AI Analysis

This work addresses the understudied area of multi-modal few-shot voice cloning, which could enhance applications like personalized speech synthesis, though it appears incremental as it builds on existing models like Tacotron2.

The paper tackles the problem of few-shot voice cloning by proposing a multi-modal learning approach that extends Tacotron2 with an unsupervised speech representation module, achieving significant performance improvements over single-modal systems in both few-shot text-to-speech and voice conversion scenarios.

Recently, few-shot voice cloning has achieved a significant improvement. However, most models for few-shot voice cloning are single-modal, and multi-modal few-shot voice cloning has been understudied. In this paper, we propose to use multi-modal learning to improve the few-shot voice cloning performance. Inspired by the recent works on unsupervised speech representation, the proposed multi-modal system is built by extending Tacotron2 with an unsupervised speech representation module. We evaluate our proposed system in two few-shot voice cloning scenarios, namely few-shot text-to-speech(TTS) and voice conversion(VC). Experimental results demonstrate that the proposed multi-modal learning can significantly improve the few-shot voice cloning performance over their counterpart single-modal systems.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes