SD LG ASApr 13, 2021

NoiseVC: Towards High Quality Zero-Shot Voice Conversion

arXiv:2104.06074v111.76 citations

Originality Incremental advance

AI Analysis

This work addresses the challenge of high-quality voice conversion for unseen speakers, though it is incremental as it builds on existing disentanglement methods.

The paper tackles the problem of zero-shot voice conversion, where source and target speakers are unseen during training, by proposing NoiseVC, which uses Vector Quantization and Contrastive Predictive Coding with noise augmentation to achieve strong disentanglement of linguistic content with only a small sacrifice in sound quality.

Voice conversion (VC) is a task that transforms voice from target audio to source without losing linguistic contents, it is challenging especially when source and target speakers are unseen during training (zero-shot VC). Previous approaches require a pre-trained model or linguistic data to do the zero-shot conversion. Meanwhile, VC models with Vector Quantization (VQ) or Instance Normalization (IN) are able to disentangle contents from audios and achieve successful conversions. However, disentanglement in these models highly relies on heavily constrained bottleneck layers, thus, the sound quality is drastically sacrificed. In this paper, we propose NoiseVC, an approach that can disentangle contents based on VQ and Contrastive Predictive Coding (CPC). Additionally, Noise Augmentation is performed to further enhance disentanglement capability. We conduct several experiments and demonstrate that NoiseVC has a strong disentanglement ability with a small sacrifice of quality.

View on arXiv PDF

Similar