SDAIHCLGASOct 10, 2025

Emotion-Disentangled Embedding Alignment for Noise-Robust and Cross-Corpus Speech Emotion Recognition

arXiv:2510.09072v1h-index: 12
Originality Incremental advance
AI Analysis

This work addresses the challenge of real-world speech emotion recognition for applications in noisy and variable conditions, representing an incremental improvement over existing methods.

The paper tackled the problem of speech emotion recognition in noisy environments and across different datasets by introducing a two-step approach for improved representation learning, resulting in enhanced robustness and generalization as demonstrated by improved performance on unseen noisy and cross-corpus samples.

Effectiveness of speech emotion recognition in real-world scenarios is often hindered by noisy environments and variability across datasets. This paper introduces a two-step approach to enhance the robustness and generalization of speech emotion recognition models through improved representation learning. First, our model employs EDRL (Emotion-Disentangled Representation Learning) to extract class-specific discriminative features while preserving shared similarities across emotion categories. Next, MEA (Multiblock Embedding Alignment) refines these representations by projecting them into a joint discriminative latent subspace that maximizes covariance with the original speech input. The learned EDRL-MEA embeddings are subsequently used to train an emotion classifier using clean samples from publicly available datasets, and are evaluated on unseen noisy and cross-corpus speech samples. Improved performance under these challenging conditions demonstrates the effectiveness of the proposed method.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes