CVAILGMMAug 14, 2024

End-to-end Semantic-centric Video-based Multimodal Affective Computing

arXiv:2408.07694v14 citationsh-index: 21
Originality Incremental advance
AI Analysis

This work addresses multimodal affective computing for human-AI interaction, but it appears incremental as it builds on existing fusion methods with specific improvements.

The paper tackled the problem of semantic imbalance and mismatch in multimodal affective computing for human-spoken videos by proposing an end-to-end framework, SemanticMAC, which outperformed state-of-the-art methods on 7 public datasets across four downstream tasks.

In the pathway toward Artificial General Intelligence (AGI), understanding human's affection is essential to enhance machine's cognition abilities. For achieving more sensual human-AI interaction, Multimodal Affective Computing (MAC) in human-spoken videos has attracted increasing attention. However, previous methods are mainly devoted to designing multimodal fusion algorithms, suffering from two issues: semantic imbalance caused by diverse pre-processing operations and semantic mismatch raised by inconsistent affection content contained in different modalities comparing with the multimodal ground truth. Besides, the usage of manual features extractors make they fail in building end-to-end pipeline for multiple MAC downstream tasks. To address above challenges, we propose a novel end-to-end framework named SemanticMAC to compute multimodal semantic-centric affection for human-spoken videos. We firstly employ pre-trained Transformer model in multimodal data pre-processing and design Affective Perceiver module to capture unimodal affective information. Moreover, we present a semantic-centric approach to unify multimodal representation learning in three ways, including gated feature interaction, multi-task pseudo label generation, and intra-/inter-sample contrastive learning. Finally, SemanticMAC effectively learn specific- and shared-semantic representations in the guidance of semantic-centric labels. Extensive experimental results demonstrate that our approach surpass the state-of-the-art methods on 7 public datasets in four MAC downstream tasks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes