ACE-VC: Adaptive and Controllable Voice Conversion using Explicitly Disentangled Self-supervised Speech Representations
This work addresses voice conversion for applications like speech synthesis and personalization, offering adaptive and controllable synthesis with incremental improvements in disentanglement and synthesis techniques.
The paper tackles zero-shot voice conversion by proposing a method that disentangles speech into linguistic content, speaker characteristics, and speaking style using self-supervised learning, achieving state-of-the-art results with a speaker verification EER of 5.5% for seen and 8.4% for unseen speakers using only 10 seconds of target data.
In this work, we propose a zero-shot voice conversion method using speech representations trained with self-supervised learning. First, we develop a multi-task model to decompose a speech utterance into features such as linguistic content, speaker characteristics, and speaking style. To disentangle content and speaker representations, we propose a training strategy based on Siamese networks that encourages similarity between the content representations of the original and pitch-shifted audio. Next, we develop a synthesis model with pitch and duration predictors that can effectively reconstruct the speech signal from its decomposed representation. Our framework allows controllable and speaker-adaptive synthesis to perform zero-shot any-to-any voice conversion achieving state-of-the-art results on metrics evaluating speaker similarity, intelligibility, and naturalness. Using just 10 seconds of data for a target speaker, our framework can perform voice swapping and achieves a speaker verification EER of 5.5% for seen speakers and 8.4% for unseen speakers.