ASAISDDec 29, 2023

Attention-based Interactive Disentangling Network for Instance-level Emotional Voice Conversion

arXiv:2312.17508v17 citationsh-index: 28INTERSPEECH
Originality Incremental advance
AI Analysis

This work addresses the challenge of expressing fine-grained emotions in voice conversion for applications like human-computer interaction, though it appears incremental as it builds on existing disentanglement approaches.

The paper tackles the problem of fine-grained emotional attribute expression in Emotional Voice Conversion by proposing an Attention-based Interactive Disentangling Network (AINN) with a two-stage training pipeline, resulting in outperforming state-of-the-art methods in objective and subjective metrics.

Emotional Voice Conversion aims to manipulate a speech according to a given emotion while preserving non-emotion components. Existing approaches cannot well express fine-grained emotional attributes. In this paper, we propose an Attention-based Interactive diseNtangling Network (AINN) that leverages instance-wise emotional knowledge for voice conversion. We introduce a two-stage pipeline to effectively train our network: Stage I utilizes inter-speech contrastive learning to model fine-grained emotion and intra-speech disentanglement learning to better separate emotion and content. In Stage II, we propose to regularize the conversion with a multi-view consistency mechanism. This technique helps us transfer fine-grained emotion and maintain speech content. Extensive experiments show that our AINN outperforms state-of-the-arts in both objective and subjective metrics.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes