MMAISDASSep 12, 2024

Early Joint Learning of Emotion Information Makes MultiModal Model Understand You Better

arXiv:2409.18971v18 citationsh-index: 9
Originality Synthesis-oriented
AI Analysis

This addresses emotion recognition for multimodal AI systems, but is incremental as it builds on existing challenge frameworks with hybrid methods.

The paper tackles multimodal emotion recognition by using early fusion of audio and text features with a large language model to reduce modal competition, combined with data mining techniques and audio preprocessing. Their approach achieved 2nd place in two MER2024 sub-challenges.

In this paper, we present our solutions for emotion recognition in the sub-challenges of Multimodal Emotion Recognition Challenge (MER2024). To mitigate the modal competition issue between audio and text, we adopt an early fusion strategy based on a large language model, where joint training of audio and text is conducted initially. And the joint Audio-Text modal feature will be late-fused with other unimodal features. In order to solve the problems of data insufficiency and class imbalance, We use multiple turns of multi-model voting for data mining. Moreover, to enhance the quality of audio features, we employ speech source separation to preprocess audios. Our model ranks \textbf{2nd} in both MER2024-SEMI and MER2024-NOISE, validating our method's effectiveness.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes