More Is Better: A MoE-Based Emotion Recognition Framework with Human Preference Alignment
This work addresses emotion recognition for applications like human-computer interaction, but it is incremental as it builds on existing MoE and pseudo-labeling techniques.
The paper tackles emotion recognition in semi-supervised learning by proposing a Mixture of Experts framework that integrates diverse modalities and uses consensus-based pseudo-labeling, achieving an F1-score of 0.8772 and ranking 2nd in the MER2025-SEMI challenge.
In this paper, we present our solution for the semi-supervised learning track (MER-SEMI) in MER2025. We propose a comprehensive framework, grounded in the principle that "more is better," to construct a robust Mixture of Experts (MoE) emotion recognition system. Our approach integrates a diverse range of input modalities as independent experts, including novel signals such as knowledge from large Vision-Language Models (VLMs) and temporal Action Unit (AU) information. To effectively utilize unlabeled data, we introduce a consensus-based pseudo-labeling strategy, generating high-quality labels from the agreement between a baseline model and Gemini, which are then used in a two-stage training paradigm. Finally, we employ a multi-expert voting ensemble combined with a rule-based re-ranking process to correct prediction bias and better align the outputs with human preferences. Evaluated on the MER2025-SEMI challenge dataset, our method achieves an F1-score of 0.8772 on the test set, ranking 2nd in the track. Our code is available at https://github.com/zhuyjan/MER2025-MRAC25.