Speaker-invariant Affective Representation Learning via Adversarial Training
This work addresses the problem of speaker-invariant emotion recognition for applications in human-computer interaction, though it is incremental as it builds on existing adversarial techniques.
The paper tackled the challenge of speaker variability in speech emotion recognition by proposing an adversarial training framework to disentangle speaker characteristics from emotion, resulting in improved classification and better generalization to unseen speakers on IEMOCAP and CMU-MOSEI datasets.
Representation learning for speech emotion recognition is challenging due to labeled data sparsity issue and lack of gold standard references. In addition, there is much variability from input speech signals, human subjective perception of the signals and emotion label ambiguity. In this paper, we propose a machine learning framework to obtain speech emotion representations by limiting the effect of speaker variability in the speech signals. Specifically, we propose to disentangle the speaker characteristics from emotion through an adversarial training network in order to better represent emotion. Our method combines the gradient reversal technique with an entropy loss function to remove such speaker information. Our approach is evaluated on both IEMOCAP and CMU-MOSEI datasets. We show that our method improves speech emotion classification and increases generalization to unseen speakers.