SDAIASDec 25, 2023

DSNet: Disentangled Siamese Network with Neutral Calibration for Speech Emotion Recognition

arXiv:2312.15593v11 citationsh-index: 5Journal of Shanghai Jiaotong University (Science)
Originality Highly original
AI Analysis

This work addresses the generalization challenge in SER for practical applications, representing an incremental improvement with a novel method for a known bottleneck.

The paper tackled the problem of speech emotion recognition (SER) by addressing the unconscious encoding of emotion-irrelevant factors like speaker variability, proposing DSNet with disentangled features and neutral calibration to improve robustness and explainability, achieving superior results on benchmark datasets compared to state-of-the-art methods.

One persistent challenge in deep learning based speech emotion recognition (SER) is the unconscious encoding of emotion-irrelevant factors (e.g., speaker or phonetic variability), which limits the generalization of SER in practical use. In this paper, we propose DSNet, a Disentangled Siamese Network with neutral calibration, to meet the demand for a more robust and explainable SER model. Specifically, we introduce an orthogonal feature disentanglement module to explicitly project the high-level representation into two distinct subspaces. Later, we propose a novel neutral calibration mechanism to encourage one subspace to capture sufficient emotion-irrelevant information. In this way, the other one can better isolate and emphasize the emotion-relevant information within speech signals. Experimental results on two popular benchmark datasets demonstrate the superiority of DSNet over various state-of-the-art methods for speaker-independent SER.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes