CLJun 11, 2025

MEDUSA: A Multimodal Deep Fusion Multi-Stage Training Framework for Speech Emotion Recognition in Naturalistic Conditions

Georgios Chatzichristodoulou, Despoina Kosmopoulou, Antonios Kritikos, Anastasia Poulopoulou, Efthymios Georgiou, Athanasios Katsamanis, Vassilis Katsouros, Alexandros Potamianos

arXiv:2506.09556v24.94 citationsh-index: 43Has CodeINTERSPEECH

Originality Incremental advance

AI Analysis

This work addresses the problem of accurately recognizing emotions from speech in real-world settings for applications like human-computer interaction, though it appears incremental as it builds on existing multimodal and ensemble methods.

The authors tackled speech emotion recognition in naturalistic conditions by proposing MEDUSA, a multimodal framework with a four-stage training pipeline that handles class imbalance and emotion ambiguity, achieving first place in the Interspeech 2025 challenge for categorical emotion recognition.

SER is a challenging task due to the subjective nature of human emotions and their uneven representation under naturalistic conditions. We propose MEDUSA, a multimodal framework with a four-stage training pipeline, which effectively handles class imbalance and emotion ambiguity. The first two stages train an ensemble of classifiers that utilize DeepSER, a novel extension of a deep cross-modal transformer fusion mechanism from pretrained self-supervised acoustic and linguistic representations. Manifold MixUp is employed for further regularization. The last two stages optimize a trainable meta-classifier that combines the ensemble predictions. Our training approach incorporates human annotation scores as soft targets, coupled with balanced data sampling and multitask learning. MEDUSA ranked 1st in Task 1: Categorical Emotion Recognition in the Interspeech 2025: Speech Emotion Recognition in Naturalistic Conditions Challenge.

View on arXiv PDF Code

Similar