ASLGSDAug 28, 2024

Multi-modal Adversarial Training for Zero-Shot Voice Cloning

arXiv:2408.15916v14 citationsh-index: 10
Originality Incremental advance
AI Analysis

This work addresses the challenge of generating natural-sounding, varied speech for zero-shot voice cloning, which is important for applications like personalized voice assistants, but it is incremental as it builds on existing GAN-based methods.

The paper tackles the problem of zero-shot voice cloning in text-to-speech models, which often produce average-sounding speech lacking natural variations, by proposing a multi-modal adversarial training technique with a Transformer encoder-decoder discriminator, resulting in improvements in speech quality and speaker similarity over the baseline.

A text-to-speech (TTS) model trained to reconstruct speech given text tends towards predictions that are close to the average characteristics of a dataset, failing to model the variations that make human speech sound natural. This problem is magnified for zero-shot voice cloning, a task that requires training data with high variance in speaking styles. We build off of recent works which have used Generative Advsarial Networks (GAN) by proposing a Transformer encoder-decoder architecture to conditionally discriminates between real and generated speech features. The discriminator is used in a training pipeline that improves both the acoustic and prosodic features of a TTS model. We introduce our novel adversarial training technique by applying it to a FastSpeech2 acoustic model and training on Libriheavy, a large multi-speaker dataset, for the task of zero-shot voice cloning. Our model achieves improvements over the baseline in terms of speech quality and speaker similarity. Audio examples from our system are available online.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes