AICLCVMMOct 9, 2023

Learning Language-guided Adaptive Hyper-modality Representation for Multimodal Sentiment Analysis

arXiv:2310.05804v2159 citationsh-index: 7
AI Analysis

This addresses performance limitations in multimodal sentiment analysis for applications like social media analysis, though it is incremental as it builds on existing transformer-based methods.

The paper tackles the problem of sentiment-irrelevant and conflicting information hindering multimodal sentiment analysis by introducing an adaptive hyper-modality representation guided by language, achieving state-of-the-art performance on datasets like MOSI, MOSEI, and CH-SIMS.

Though Multimodal Sentiment Analysis (MSA) proves effective by utilizing rich information from multiple sources (e.g., language, video, and audio), the potential sentiment-irrelevant and conflicting information across modalities may hinder the performance from being further improved. To alleviate this, we present Adaptive Language-guided Multimodal Transformer (ALMT), which incorporates an Adaptive Hyper-modality Learning (AHL) module to learn an irrelevance/conflict-suppressing representation from visual and audio features under the guidance of language features at different scales. With the obtained hyper-modality representation, the model can obtain a complementary and joint representation through multimodal fusion for effective MSA. In practice, ALMT achieves state-of-the-art performance on several popular datasets (e.g., MOSI, MOSEI and CH-SIMS) and an abundance of ablation demonstrates the validity and necessity of our irrelevance/conflict suppression mechanism.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes