ASAISep 21, 2025

MaskVCT: Masked Voice Codec Transformer for Zero-Shot Voice Conversion With Increased Controllability via Multiple Guidances

arXiv:2509.17143v11 citationsh-index: 47
Originality Incremental advance
AI Analysis

This work addresses voice conversion for users needing flexible control in zero-shot settings, representing an incremental improvement over previous models.

The paper tackles the problem of zero-shot voice conversion by introducing MaskVCT, a model that integrates multiple classifier-free guidances for enhanced controllability over speaker identity, linguistic content, and prosody, achieving the best target speaker and accent similarities with competitive word and character error rates compared to baselines.

We introduce MaskVCT, a zero-shot voice conversion (VC) model that offers multi-factor controllability through multiple classifier-free guidances (CFGs). While previous VC models rely on a fixed conditioning scheme, MaskVCT integrates diverse conditions in a single model. To further enhance robustness and control, the model can leverage continuous or quantized linguistic features to enhance intellgibility and speaker similarity, and can use or omit pitch contour to control prosody. These choices allow users to seamlessly balance speaker identity, linguistic content, and prosodic factors in a zero-shot VC setting. Extensive experiments demonstrate that MaskVCT achieves the best target speaker and accent similarities while obtaining competitive word and character error rates compared to existing baselines. Audio samples are available at https://maskvct.github.io/.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes