SDLGASApr 13, 2021

NoiseVC: Towards High Quality Zero-Shot Voice Conversion

arXiv:2104.06074v16 citations
Originality Incremental advance
AI Analysis

This work addresses the challenge of high-quality voice conversion for unseen speakers, though it is incremental as it builds on existing disentanglement methods.

The paper tackles the problem of zero-shot voice conversion, where source and target speakers are unseen during training, by proposing NoiseVC, which uses Vector Quantization and Contrastive Predictive Coding with noise augmentation to achieve strong disentanglement of linguistic content with only a small sacrifice in sound quality.

Voice conversion (VC) is a task that transforms voice from target audio to source without losing linguistic contents, it is challenging especially when source and target speakers are unseen during training (zero-shot VC). Previous approaches require a pre-trained model or linguistic data to do the zero-shot conversion. Meanwhile, VC models with Vector Quantization (VQ) or Instance Normalization (IN) are able to disentangle contents from audios and achieve successful conversions. However, disentanglement in these models highly relies on heavily constrained bottleneck layers, thus, the sound quality is drastically sacrificed. In this paper, we propose NoiseVC, an approach that can disentangle contents based on VQ and Contrastive Predictive Coding (CPC). Additionally, Noise Augmentation is performed to further enhance disentanglement capability. We conduct several experiments and demonstrate that NoiseVC has a strong disentanglement ability with a small sacrifice of quality.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes