CL MMMar 17, 2024

Mixture-of-Prompt-Experts for Multi-modal Semantic Understanding

Zichen Wu, Hsiu-Yuan Huang, Fanyi Qu, Yunfang Wu

arXiv:2403.11311v224.684 citationsh-index: 5LREC

Originality Incremental advance

AI Analysis

This addresses the challenge of high-quality data annotation in multi-modal AI for tasks like sarcasm detection and sentiment analysis, offering an efficient few-shot learning solution.

The paper tackles few-shot multi-modal sarcasm detection and sentiment analysis by proposing Mixture-of-Prompt-Experts with Block-Aware Prompt Fusion, achieving superior performance over larger models like InstructBLIP with only 2% parameters and outperforming other prompt-based methods.

Deep multimodal semantic understanding that goes beyond the mere superficial content relation mining has received increasing attention in the realm of artificial intelligence. The challenges of collecting and annotating high-quality multi-modal data have underscored the significance of few-shot learning. In this paper, we focus on two critical tasks under this context: few-shot multi-modal sarcasm detection (MSD) and multi-modal sentiment analysis (MSA). To address them, we propose Mixture-of-Prompt-Experts with Block-Aware Prompt Fusion (MoPE-BAF), a novel multi-modal soft prompt framework based on the unified vision-language model (VLM). Specifically, we design three experts of soft prompts: a text prompt and an image prompt that extract modality-specific features to enrich the single-modal representation, and a unified prompt to assist multi-modal interaction. Additionally, we reorganize Transformer layers into several blocks and introduce cross-modal prompt attention between adjacent blocks, which smoothens the transition from single-modal representation to multi-modal fusion. On both MSD and MSA datasets in few-shot setting, our proposed model not only surpasses the 8.2B model InstructBLIP with merely 2% parameters (150M), but also significantly outperforms other widely-used prompt methods on VLMs or task-specific methods.

View on arXiv PDF

Similar