CLFeb 9, 2021

Learning Modality-Specific Representations with Self-Supervised Multi-Task Learning for Multimodal Sentiment Analysis

arXiv:2102.04830v1760 citationsHas Code
AI Analysis

This work provides a method for improving multimodal sentiment analysis by generating unimodal supervisions without additional annotation costs, which is beneficial for researchers and practitioners working with multimodal data.

This paper addresses the challenge of learning effective modality representations for multimodal sentiment analysis by proposing a self-supervised label generation module to create independent unimodal supervisions. By jointly training multimodal and unimodal tasks, the method learns both consistent and differentiated information across modalities, achieving state-of-the-art performance on the MOSI and MOSEI datasets and comparable performance to human-annotated unimodal labels on the SIMS dataset.

Representation Learning is a significant and challenging task in multimodal learning. Effective modality representations should contain two parts of characteristics: the consistency and the difference. Due to the unified multimodal annotation, existing methods are restricted in capturing differentiated information. However, additional uni-modal annotations are high time- and labor-cost. In this paper, we design a label generation module based on the self-supervised learning strategy to acquire independent unimodal supervisions. Then, joint training the multi-modal and uni-modal tasks to learn the consistency and difference, respectively. Moreover, during the training stage, we design a weight-adjustment strategy to balance the learning progress among different subtasks. That is to guide the subtasks to focus on samples with a larger difference between modality supervisions. Last, we conduct extensive experiments on three public multimodal baseline datasets. The experimental results validate the reliability and stability of auto-generated unimodal supervisions. On MOSI and MOSEI datasets, our method surpasses the current state-of-the-art methods. On the SIMS dataset, our method achieves comparable performance than human-annotated unimodal labels. The full codes are available at https://github.com/thuiar/Self-MM.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes