Zero-shot Voice Conversion via Self-supervised Prosody Representation Learning
This addresses the issue of speaker voice mismatch in zero-shot voice conversion for applications like voice customizing and animation production, representing an incremental improvement over existing disentanglement methods.
The paper tackled the problem of prosody leakage in zero-shot voice conversion, which causes synthesized speech to deviate from target speakers, by proposing a self-supervised method to learn disentangled pitch and volume representations; the result showed improved performance that surpassed state-of-the-art methods.
Voice Conversion (VC) for unseen speakers, also known as zero-shot VC, is an attractive research topic as it enables a range of applications like voice customizing, animation production, and others. Recent work in this area made progress with disentanglement methods that separate utterance content and speaker characteristics from speech audio recordings. However, many of these methods are subject to the leakage of prosody (e.g., pitch, volume), causing the speaker voice in the synthesized speech to be different from the desired target speakers. To prevent this issue, we propose a novel self-supervised approach that effectively learns disentangled pitch and volume representations that can represent the prosody styles of different speakers. We then use the learned prosodic representations as conditional information to train and enhance our VC model for zero-shot conversion. In our experiments, we show that our prosody representations are disentangled and rich in prosody information. Moreover, we demonstrate that the addition of our prosody representations improves our VC performance and surpasses state-of-the-art zero-shot VC performances.