SDAIASMay 20, 2025

ClapFM-EVC: High-Fidelity and Flexible Emotional Voice Conversion with Dual Control from Natural Language and Speech

arXiv:2505.13805v16 citationsh-index: 14INTERSPEECH
Originality Incremental advance
AI Analysis

This work addresses the problem of interpretable control in emotional voice conversion for applications like speech synthesis, though it appears incremental by building on existing methods.

The paper tackles the challenge of achieving high-fidelity and flexible emotional voice conversion by introducing ClapFM-EVC, a framework that generates converted speech using natural language prompts or reference speech with adjustable emotion intensity, validated through subjective and objective evaluations.

Despite great advances, achieving high-fidelity emotional voice conversion (EVC) with flexible and interpretable control remains challenging. This paper introduces ClapFM-EVC, a novel EVC framework capable of generating high-quality converted speech driven by natural language prompts or reference speech with adjustable emotion intensity. We first propose EVC-CLAP, an emotional contrastive language-audio pre-training model, guided by natural language prompts and categorical labels, to extract and align fine-grained emotional elements across speech and text modalities. Then, a FuEncoder with an adaptive intensity gate is presented to seamless fuse emotional features with Phonetic PosteriorGrams from a pre-trained ASR model. To further improve emotion expressiveness and speech naturalness, we propose a flow matching model conditioned on these captured features to reconstruct Mel-spectrogram of source speech. Subjective and objective evaluations validate the effectiveness of ClapFM-EVC.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes