LGAIMar 31, 2025

Unimodal-driven Distillation in Multimodal Emotion Recognition with Dynamic Fusion

arXiv:2503.23721v13 citationsh-index: 4ICME
Originality Highly original
AI Analysis

This work improves emotion recognition for intelligent dialogue systems and opinion analysis, but it is incremental as it builds on existing multimodal fusion methods with novel distillation techniques.

The paper tackles the problem of multimodal emotion recognition in conversations by addressing disorientation from modal heterogeneity, proposing a framework that uses a pre-trained unimodal teacher to guide multimodal fusion and achieves state-of-the-art performance on IEMOCAP and MELD datasets, especially for minority and semantically similar emotions.

Multimodal Emotion Recognition in Conversations (MERC) identifies emotional states across text, audio and video, which is essential for intelligent dialogue systems and opinion analysis. Existing methods emphasize heterogeneous modal fusion directly for cross-modal integration, but often suffer from disorientation in multimodal learning due to modal heterogeneity and lack of instructive guidance. In this work, we propose SUMMER, a novel heterogeneous multimodal integration framework leveraging Mixture of Experts with Hierarchical Cross-modal Fusion and Interactive Knowledge Distillation. Key components include a Sparse Dynamic Mixture of Experts (SDMoE) for capturing dynamic token-wise interactions, a Hierarchical Cross-Modal Fusion (HCMF) for effective fusion of heterogeneous modalities, and Interactive Knowledge Distillation (IKD), which uses a pre-trained unimodal teacher to guide multimodal fusion in latent and logit spaces. Experiments on IEMOCAP and MELD show SUMMER outperforms state-of-the-art methods, particularly in recognizing minority and semantically similar emotions.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes