Poly-SVC: Polyphony-Aware Singing Voice Conversion with Harmonic Modeling
For SVC researchers, Poly-SVC addresses the underexplored problem of processing residual harmonies in polyphonic recordings, enabling more realistic conversions.
Poly-SVC introduces a zero-shot, cross-lingual singing voice conversion system that handles residual harmonies from accompanied recordings, outperforming baselines in naturalness, timbre similarity, and harmony reconstruction.
Singing Voice Conversion (SVC) aims to transform a source singing voice into a target singer while preserving lyrics and melody. Most existing SVC methods depend on F0 extractors to capture the lead melody from clean vocals. However, no existing method can reliably extract clean vocals from accompanied recordings without leaving residual harmonies behind. In this paper, we innovatively propose Poly-SVC, a zero-shot, cross-lingual singing voice conversion system designed to process residual harmonies. Poly-SVC is composed of three key components: a Constant-Q Transform (CQT)-based pitch extractor to preserve both the lead melody and residual harmony, a random sampler to reduce interference information from the CQT and a diffusion decoder based on Conditional Flow Matching (CFM) that fuses pitch, content, and timbre features into natural-sounding polyphonic outputs. Experiments demonstrate that Poly-SVC surpasses the baseline models in naturalness, timbre similarity and harmony reconstruction across both harmony-rich and single-melody recordings.