CVMar 21, 2025

Feature-Based Dual Visual Feature Extraction Model for Compound Multimodal Emotion Recognition

arXiv:2503.17453v12 citationsh-index: 11Has Code
Originality Synthesis-oriented
AI Analysis

This work addresses uncertainty and modal conflicts in compound emotion recognition for affective computing and human-computer interaction, but it appears incremental as it combines existing models.

The paper tackles compound multimodal emotion recognition by proposing a method that fuses Vision Transformer and Residual Network features, achieving superior performance on the C-EXPR-DB dataset in complex scenarios.

This article presents our results for the eighth Affective Behavior Analysis in-the-wild (ABAW) competition.Multimodal emotion recognition (ER) has important applications in affective computing and human-computer interaction. However, in the real world, compound emotion recognition faces greater issues of uncertainty and modal conflicts. For the Compound Expression (CE) Recognition Challenge,this paper proposes a multimodal emotion recognition method that fuses the features of Vision Transformer (ViT) and Residual Network (ResNet). We conducted experiments on the C-EXPR-DB and MELD datasets. The results show that in scenarios with complex visual and audio cues (such as C-EXPR-DB), the model that fuses the features of ViT and ResNet exhibits superior performance.Our code are avalible on https://github.com/MyGitHub-ax/8th_ABAW

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes