MMAISDASJul 29, 2025

Sync-TVA: A Graph-Attention Framework for Multimodal Emotion Recognition with Cross-Modal Fusion

arXiv:2507.21395v11 citationsh-index: 1
Originality Incremental advance
AI Analysis

This work addresses multimodal emotion recognition for emotionally intelligent systems, presenting an incremental improvement over existing methods.

The paper tackled the problem of limited cross-modal interaction and imbalanced contributions in multimodal emotion recognition by proposing Sync-TVA, a graph-attention framework with dynamic enhancement and structured fusion, achieving consistent improvements in accuracy and weighted F1 score on MELD and IEMOCAP datasets.

Multimodal emotion recognition (MER) is crucial for enabling emotionally intelligent systems that perceive and respond to human emotions. However, existing methods suffer from limited cross-modal interaction and imbalanced contributions across modalities. To address these issues, we propose Sync-TVA, an end-to-end graph-attention framework featuring modality-specific dynamic enhancement and structured cross-modal fusion. Our design incorporates a dynamic enhancement module for each modality and constructs heterogeneous cross-modal graphs to model semantic relations across text, audio, and visual features. A cross-attention fusion mechanism further aligns multimodal cues for robust emotion inference. Experiments on MELD and IEMOCAP demonstrate consistent improvements over state-of-the-art models in both accuracy and weighted F1 score, especially under class-imbalanced conditions.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes