Unimodal-driven Distillation in Multimodal Emotion Recognition with Dynamic Fusion
This work improves emotion recognition for intelligent dialogue systems and opinion analysis, but it is incremental as it builds on existing multimodal fusion methods with novel distillation techniques.
The paper tackles the problem of multimodal emotion recognition in conversations by addressing disorientation from modal heterogeneity, proposing a framework that uses a pre-trained unimodal teacher to guide multimodal fusion and achieves state-of-the-art performance on IEMOCAP and MELD datasets, especially for minority and semantically similar emotions.
Multimodal Emotion Recognition in Conversations (MERC) identifies emotional states across text, audio and video, which is essential for intelligent dialogue systems and opinion analysis. Existing methods emphasize heterogeneous modal fusion directly for cross-modal integration, but often suffer from disorientation in multimodal learning due to modal heterogeneity and lack of instructive guidance. In this work, we propose SUMMER, a novel heterogeneous multimodal integration framework leveraging Mixture of Experts with Hierarchical Cross-modal Fusion and Interactive Knowledge Distillation. Key components include a Sparse Dynamic Mixture of Experts (SDMoE) for capturing dynamic token-wise interactions, a Hierarchical Cross-Modal Fusion (HCMF) for effective fusion of heterogeneous modalities, and Interactive Knowledge Distillation (IKD), which uses a pre-trained unimodal teacher to guide multimodal fusion in latent and logit spaces. Experiments on IEMOCAP and MELD show SUMMER outperforms state-of-the-art methods, particularly in recognizing minority and semantically similar emotions.