Speaker Normalization for Self-supervised Speech Emotion Recognition
This addresses the challenge of generalization in emotion recognition for applications where small datasets with biases are common, though it is incremental as it builds on existing adversarial methods.
The paper tackles the problem of speaker bias in speech emotion recognition by proposing a gradient-based adversary learning framework that normalizes speaker characteristics from feature representations, achieving new state-of-the-art results on the IEMOCAP dataset.
Large speech emotion recognition datasets are hard to obtain, and small datasets may contain biases. Deep-net-based classifiers, in turn, are prone to exploit those biases and find shortcuts such as speaker characteristics. These shortcuts usually harm a model's ability to generalize. To address this challenge, we propose a gradient-based adversary learning framework that learns a speech emotion recognition task while normalizing speaker characteristics from the feature representation. We demonstrate the efficacy of our method on both speaker-independent and speaker-dependent settings and obtain new state-of-the-art results on the challenging IEMOCAP dataset.