ASAIJun 12, 2025

RT-VC: Real-Time Zero-Shot Voice Conversion with Speech Articulatory Coding

arXiv:2506.10289v13 citationsh-index: 25ACL
Originality Incremental advance
AI Analysis

This work addresses the need for low-latency voice conversion in applications like assistive communication and entertainment, representing an incremental improvement in efficiency.

The paper tackled real-time zero-shot voice conversion by introducing RT-VC, which uses articulatory features and differentiable digital signal processing to achieve a CPU latency of 61.4 ms, a 13.3% reduction compared to state-of-the-art methods while maintaining similar synthesis quality.

Voice conversion has emerged as a pivotal technology in numerous applications ranging from assistive communication to entertainment. In this paper, we present RT-VC, a zero-shot real-time voice conversion system that delivers ultra-low latency and high-quality performance. Our approach leverages an articulatory feature space to naturally disentangle content and speaker characteristics, facilitating more robust and interpretable voice transformations. Additionally, the integration of differentiable digital signal processing (DDSP) enables efficient vocoding directly from articulatory features, significantly reducing conversion latency. Experimental evaluations demonstrate that, while maintaining synthesis quality comparable to the current state-of-the-art (SOTA) method, RT-VC achieves a CPU latency of 61.4 ms, representing a 13.3\% reduction in latency.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes