Transfer Learning with Joint Fine-Tuning for Multimodal Sentiment Analysis
This work addresses the problem of high computational costs in multimodal sentiment analysis for social media applications, offering a more flexible and efficient solution, though it is incremental in nature.
The paper tackles the computational expense of pre-training and fine-tuning multimodal models for sentiment analysis by introducing a transfer learning approach with joint fine-tuning, achieving competitive results using a simpler strategy that combines pre-trained unimodal models efficiently.
Most existing methods focus on sentiment analysis of textual data. However, recently there has been a massive use of images and videos on social platforms, motivating sentiment analysis from other modalities. Current studies show that exploring other modalities (e.g., images) increases sentiment analysis performance. State-of-the-art multimodal models, such as CLIP and VisualBERT, are pre-trained on datasets with the text paired with images. Although the results obtained by these models are promising, pre-training and sentiment analysis fine-tuning tasks of these models are computationally expensive. This paper introduces a transfer learning approach using joint fine-tuning for sentiment analysis. Our proposal achieved competitive results using a more straightforward alternative fine-tuning strategy that leverages different pre-trained unimodal models and efficiently combines them in a multimodal space. Moreover, our proposal allows flexibility when incorporating any pre-trained model for texts and images during the joint fine-tuning stage, being especially interesting for sentiment classification in low-resource scenarios.