SDLGASNov 15, 2024

Zero-shot Voice Conversion with Diffusion Transformers

arXiv:2411.09943v160 citationsh-index: 1
Originality Incremental advance
AI Analysis

This work addresses core challenges in zero-shot voice conversion for speech and singing applications, offering improved accuracy and versatility, though it appears incremental as it builds on existing diffusion and transformer methods.

The paper tackled the problem of zero-shot voice conversion, where traditional methods suffer from timbre leakage and mismatches, by proposing Seed-VC, a framework that uses an external timbre shifter and a diffusion transformer to capture fine-grained timbre features, resulting in higher speaker similarity and lower word error rates compared to baselines like OpenVoice and CosyVoice.

Zero-shot voice conversion aims to transform a source speech utterance to match the timbre of a reference speech from an unseen speaker. Traditional approaches struggle with timbre leakage, insufficient timbre representation, and mismatches between training and inference tasks. We propose Seed-VC, a novel framework that addresses these issues by introducing an external timbre shifter during training to perturb the source speech timbre, mitigating leakage and aligning training with inference. Additionally, we employ a diffusion transformer that leverages the entire reference speech context, capturing fine-grained timbre features through in-context learning. Experiments demonstrate that Seed-VC outperforms strong baselines like OpenVoice and CosyVoice, achieving higher speaker similarity and lower word error rates in zero-shot voice conversion tasks. We further extend our approach to zero-shot singing voice conversion by incorporating fundamental frequency (F0) conditioning, resulting in comparative performance to current state-of-the-art methods. Our findings highlight the effectiveness of Seed-VC in overcoming core challenges, paving the way for more accurate and versatile voice conversion systems.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes