MM CL SD ASJun 3, 2025

StarVC: A Unified Auto-Regressive Framework for Joint Text and Speech Generation in Voice Conversion

Fengjin Li, Jie Wang, Yadong Niu, Yongqing Wang, Meng Meng, Jian Luan, Zhiyong Wu

arXiv:2506.02414v12.31 citationsh-index: 4INTERSPEECH

Originality Incremental advance

AI Analysis

This work addresses voice conversion for applications requiring high-fidelity speaker and content preservation, representing an incremental improvement by integrating explicit text modeling into existing methods.

The paper tackles the problem of voice conversion by proposing StarVC, a unified autoregressive framework that predicts text tokens before synthesizing acoustic features, resulting in improved performance in preserving linguistic content (e.g., WER and CER) and speaker characteristics (e.g., SECS and MOS).

Voice Conversion (VC) modifies speech to match a target speaker while preserving linguistic content. Traditional methods usually extract speaker information directly from speech while neglecting the explicit utilization of linguistic content. Since VC fundamentally involves disentangling speaker identity from linguistic content, leveraging structured semantic features could enhance conversion performance. However, previous attempts to incorporate semantic features into VC have shown limited effectiveness, motivating the integration of explicit text modeling. We propose StarVC, a unified autoregressive VC framework that first predicts text tokens before synthesizing acoustic features. The experiments demonstrate that StarVC outperforms conventional VC methods in preserving both linguistic content (i.e., WER and CER) and speaker characteristics (i.e., SECS and MOS). Audio demo can be found at: https://thuhcsi.github.io/StarVC/.

View on arXiv PDF

Similar