ASSDOct 8, 2021

Cross-speaker Emotion Transfer Based on Speaker Condition Layer Normalization and Semi-Supervised Training in Text-To-Speech

arXiv:2110.04153v20.0019 citations
AI Analysis45

This addresses the time-consuming data acquisition issue for emotional speech synthesis, but it is an incremental improvement focused on a specific domain.

The paper tackles the problem of acquiring emotional audio for arbitrary speakers in expressive speech synthesis by proposing a cross-speaker emotion transfer method that transfers emotions from a source to a target speaker, achieving improvements in timbre similarity, stability, and emotion perception over a baseline.

In expressive speech synthesis, there are high requirements for emotion interpretation. However, it is time-consuming to acquire emotional audio corpus for arbitrary speakers due to their deduction ability. In response to this problem, this paper proposes a cross-speaker emotion transfer method that can realize the transfer of emotions from source speaker to target speaker. A set of emotion tokens is firstly defined to represent various categories of emotions. They are trained to be highly correlated with corresponding emotions for controllable synthesis by cross-entropy loss and semi-supervised training strategy. Meanwhile, to eliminate the down-gradation to the timbre similarity from cross-speaker emotion transfer, speaker condition layer normalization is implemented to model speaker characteristics. Experimental results show that the proposed method outperforms the multi-reference based baseline in terms of timbre similarity, stability and emotion perceive evaluations.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes