ASAILGSDJan 3, 2024

CoMoSVC: Consistency Model-based Singing Voice Conversion

arXiv:2401.01792v117 citationsh-index: 23ISCSLP
Originality Incremental advance
AI Analysis

This work addresses a bottleneck in SVC for audio generation applications, offering a practical speed improvement while preserving quality, though it is incremental as it builds on existing diffusion and consistency model frameworks.

The paper tackles the slow inference speed of diffusion-based singing voice conversion (SVC) methods by proposing CoMoSVC, a consistency model-based approach that achieves one-step sampling, resulting in significantly faster inference while maintaining comparable or superior performance to state-of-the-art systems.

The diffusion-based Singing Voice Conversion (SVC) methods have achieved remarkable performances, producing natural audios with high similarity to the target timbre. However, the iterative sampling process results in slow inference speed, and acceleration thus becomes crucial. In this paper, we propose CoMoSVC, a consistency model-based SVC method, which aims to achieve both high-quality generation and high-speed sampling. A diffusion-based teacher model is first specially designed for SVC, and a student model is further distilled under self-consistency properties to achieve one-step sampling. Experiments on a single NVIDIA GTX4090 GPU reveal that although CoMoSVC has a significantly faster inference speed than the state-of-the-art (SOTA) diffusion-based SVC system, it still achieves comparable or superior conversion performance based on both subjective and objective metrics. Audio samples and codes are available at https://comosvc.github.io/.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes