ASLGSDJun 7, 2020

VQVC+: One-Shot Voice Conversion by Vector Quantization and U-Net architecture

arXiv:2006.04154v1107 citations
Originality Incremental advance
AI Analysis

This work addresses voice conversion for unseen speakers in a one-shot setting, but it is incremental as it builds on existing auto-encoder and VQ methods.

The paper tackled the problem of imperfect disentanglement in one-shot voice conversion, which harms output speech quality, by integrating a U-Net architecture into an auto-encoder-based system with vector quantization as a bottleneck, resulting in improved audio naturalness and speaker similarity as shown in evaluations.

Voice conversion (VC) is a task that transforms the source speaker's timbre, accent, and tones in audio into another one's while preserving the linguistic content. It is still a challenging work, especially in a one-shot setting. Auto-encoder-based VC methods disentangle the speaker and the content in input speech without given the speaker's identity, so these methods can further generalize to unseen speakers. The disentangle capability is achieved by vector quantization (VQ), adversarial training, or instance normalization (IN). However, the imperfect disentanglement may harm the quality of output speech. In this work, to further improve audio quality, we use the U-Net architecture within an auto-encoder-based VC system. We find that to leverage the U-Net architecture, a strong information bottleneck is necessary. The VQ-based method, which quantizes the latent vectors, can serve the purpose. The objective and the subjective evaluations show that the proposed method performs well in both audio naturalness and speaker similarity.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes