SDAIASMay 30, 2025

Discl-VC: Disentangled Discrete Tokens and In-Context Learning for Controllable Zero-Shot Voice Conversion

arXiv:2505.24291v16 citationsh-index: 19INTERSPEECH
Originality Incremental advance
AI Analysis

This addresses the need for more controllable voice conversion for applications like speech synthesis, though it appears incremental as it builds on existing self-supervised and transformer methods.

The paper tackled the problem of limited controllability in zero-shot voice conversion systems, which struggle to replicate source or target speaking styles, by proposing Discl-VC, a framework that disentangles content and prosody and uses in-context learning, resulting in superior performance and remarkable accuracy in prosody control.

Currently, zero-shot voice conversion systems are capable of synthesizing the voice of unseen speakers. However, most existing approaches struggle to accurately replicate the speaking style of the source speaker or mimic the distinctive speaking style of the target speaker, thereby limiting the controllability of voice conversion. In this work, we propose Discl-VC, a novel voice conversion framework that disentangles content and prosody information from self-supervised speech representations and synthesizes the target speaker's voice through in-context learning with a flow matching transformer. To enable precise control over the prosody of generated speech, we introduce a mask generative transformer that predicts discrete prosody tokens in a non-autoregressive manner based on prompts. Experimental results demonstrate the superior performance of Discl-VC in zero-shot voice conversion and its remarkable accuracy in prosody control for synthesized speech.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes