CLJan 12, 2024

WisdoM: Improving Multimodal Sentiment Analysis by Fusing Contextual World Knowledge

arXiv:2401.06659v248 citationsh-index: 36MM
Originality Incremental advance
AI Analysis

This work addresses the limitation of superficial information in multimodal sentiment analysis for applications requiring deeper contextual understanding, though it is incremental as it builds on existing methods with a plug-in framework.

The paper tackled the problem of multimodal sentiment analysis by incorporating contextual world knowledge from large vision-language models, resulting in an average improvement of +1.96% F1 score over state-of-the-art methods.

Sentiment analysis is rapidly advancing by utilizing various data modalities (e.g., text, image). However, most previous works relied on superficial information, neglecting the incorporation of contextual world knowledge (e.g., background information derived from but beyond the given image and text pairs) and thereby restricting their ability to achieve better multimodal sentiment analysis (MSA). In this paper, we proposed a plug-in framework named WisdoM, to leverage the contextual world knowledge induced from the large vision-language models (LVLMs) for enhanced MSA. WisdoM utilizes LVLMs to comprehensively analyze both images and corresponding texts, simultaneously generating pertinent context. To reduce the noise in the context, we also introduce a training-free contextual fusion mechanism. Experiments across diverse granularities of MSA tasks consistently demonstrate that our approach has substantial improvements (brings an average +1.96% F1 score among five advanced methods) over several state-of-the-art methods.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes