ASLGFeb 6, 2025

GenVC: Self-Supervised Zero-Shot Voice Conversion

arXiv:2502.04519v211 citationsh-index: 63
AI Analysis

This addresses voice conversion for applications like privacy protection and speaker cloning, representing a novel method for a known bottleneck.

The paper tackled the problem of zero-shot voice conversion by introducing GenVC, a self-supervised framework that disentangles speaker identity and linguistic content, achieving notably higher speaker similarity with naturalness comparable to leading methods.

Most current zero-shot voice conversion methods rely on externally supervised components, particularly speaker encoders, for training. To explore alternatives that eliminate this dependency, this paper introduces GenVC, a novel framework that disentangles speaker identity and linguistic content from speech signals in a self-supervised manner. GenVC leverages speech tokenizers and an autoregressive, Transformer-based language model as its backbone for speech generation. This design supports large-scale training while enhancing both source speaker privacy protection and target speaker cloning fidelity. Experimental results demonstrate that GenVC achieves notably higher speaker similarity, with naturalness on par with leading zero-shot approaches. Moreover, due to its autoregressive formulation, GenVC introduces flexibility in temporal alignment, reducing the preservation of source prosody and speaker-specific traits, and making it highly effective for voice anonymization.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes