SDAIASNov 11, 2025

HQ-SVC: Towards High-Quality Zero-Shot Singing Voice Conversion in Low-Resource Scenarios

arXiv:2511.08496v31 citationsh-index: 7
Originality Incremental advance
AI Analysis

This addresses the challenge of high-quality singing voice conversion for unseen speakers with limited data, though it appears incremental by enhancing existing approaches.

The paper tackles the problem of zero-shot singing voice conversion in low-resource scenarios, where existing methods degrade output quality and require high computational resources, by proposing HQ-SVC, which significantly outperforms state-of-the-art methods in conversion quality and efficiency, achieving superior voice naturalness compared to specialized audio super-resolution methods.

Zero-shot singing voice conversion (SVC) transforms a source singer's timbre to an unseen target speaker's voice while preserving melodic content without fine-tuning. Existing methods model speaker timbre and vocal content separately, losing essential acoustic information that degrades output quality while requiring significant computational resources. To overcome these limitations, we propose HQ-SVC, an efficient framework for high-quality zero-shot SVC. HQ-SVC first extracts jointly content and speaker features using a decoupled codec. It then enhances fidelity through pitch and volume modeling, preserving critical acoustic information typically lost in separate modeling approaches, and progressively refines outputs via differentiable signal processing and diffusion techniques. Evaluations confirm HQ-SVC significantly outperforms state-of-the-art zero-shot SVC methods in conversion quality and efficiency. Beyond voice conversion, HQ-SVC achieves superior voice naturalness compared to specialized audio super-resolution methods while natively supporting voice super-resolution tasks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes