ASSDOct 20, 2021

Disentanglement of Emotional Style and Speaker Identity for Expressive Voice Conversion

arXiv:2110.10326v234 citations
Originality Incremental advance
AI Analysis

This work addresses the problem of generating expressive synthetic voices for applications like speech synthesis and human-computer interaction, representing an incremental improvement in voice conversion techniques.

The paper tackles the challenge of disentangling emotional style from speaker identity in expressive voice conversion, proposing StyleVC to jointly convert both aspects for arbitrary speakers, with experiments showing effectiveness in objective and subjective evaluations.

Expressive voice conversion performs identity conversion for emotional speakers by jointly converting speaker identity and emotional style. Due to the hierarchical structure of speech emotion, it is challenging to disentangle the emotional style for different speakers. Inspired by the recent success of speaker disentanglement with variational autoencoder (VAE), we propose an any-to-any expressive voice conversion framework, that is called StyleVC. StyleVC is designed to disentangle linguistic content, speaker identity, pitch, and emotional style information. We study the use of style encoder to model emotional style explicitly. At run-time, StyleVC converts both speaker identity and emotional style for arbitrary speakers. Experiments validate the effectiveness of our proposed framework in both objective and subjective evaluations.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes