SDLGOct 11, 2025

Improving Speech Emotion Recognition with Mutual Information Regularized Generative Model

arXiv:2510.10078v2h-index: 11
Originality Incremental advance
AI Analysis

This work addresses data scarcity in speech emotion recognition, an incremental improvement for applications in human-computer interaction and affective computing.

The authors tackled the problem of limited labeled training data in speech emotion recognition by proposing a data augmentation framework using mutual information regularization and cross-modal information transfer, which improved emotion prediction performance on benchmark datasets like IEMOCAP, MSP-IMPROV, and MSP-Podcast.

Although speech emotion recognition (SER) research has been advanced, thanks to deep learning methods, it still suffers from obtaining inputs from large quality-labelled training data. Data augmentation methods have been attempted to mitigate this issue, generative models have shown success among them recently. We propose a data augmentation framework that is aided by cross-modal information transfer and mutual information regularization. Mutual information based metric can serve as an indicator for the quality. Furthermore, we expand this data augmentation scope to multimodal inputs, thanks to mutual information ensureing dependency between modalities. Our framework was tested on three benchmark datasets: IEMOCAP, MSP-IMPROV and MSP-Podcast. The implementation was designed to generate input features that are fed into last layer for emotion classification. Our framework improved the performance of emotion prediction against existing works. Also, we discovered that our framework is able to generate new inputs without any cross-modal information.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes