SDAIASJun 4, 2025

Towards Better Disentanglement in Non-Autoregressive Zero-Shot Expressive Voice Conversion

arXiv:2506.04013v11 citationsh-index: 6INTERSPEECH
Originality Incremental advance
AI Analysis

This work addresses style transfer quality in voice conversion systems, representing an incremental improvement over existing non-autoregressive frameworks.

The paper tackles the problem of source timbre leakage and poor linguistic-acoustic disentanglement in expressive voice conversion, achieving superior emotion and speaker similarity compared to baselines.

Expressive voice conversion aims to transfer both speaker identity and expressive attributes from a target speech to a given source speech. In this work, we improve over a self-supervised, non-autoregressive framework with a conditional variational autoencoder, focusing on reducing source timbre leakage and improving linguistic-acoustic disentanglement for better style transfer. To minimize style leakage, we use multilingual discrete speech units for content representation and reinforce embeddings with augmentation-based similarity loss and mix-style layer normalization. To enhance expressivity transfer, we incorporate local F0 information via cross-attention and extract style embeddings enriched with global pitch and energy features. Experiments show our model outperforms baselines in emotion and speaker similarity, demonstrating superior style adaptation and reduced source style leakage.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes