MMGA: Multimodal Learning with Graph Alignment
This work addresses the challenge of incorporating non-regular graph data into multimodal learning for social media applications, though it appears incremental as it builds on existing multimodal pre-training approaches.
The authors tackled the problem of integrating graph data with image and text modalities for user representation learning on social media, proposing MMGA, a multimodal pre-training framework that improved performance on a fans prediction task using a dataset of 60,000 Instagram users.
Multimodal pre-training breaks down the modality barriers and allows the individual modalities to be mutually augmented with information, resulting in significant advances in representation learning. However, graph modality, as a very general and important form of data, cannot be easily interacted with other modalities because of its non-regular nature. In this paper, we propose MMGA (Multimodal learning with Graph Alignment), a novel multimodal pre-training framework to incorporate information from graph (social network), image and text modalities on social media to enhance user representation learning. In MMGA, a multi-step graph alignment mechanism is proposed to add the self-supervision from graph modality to optimize the image and text encoders, while using the information from the image and text modalities to guide the graph encoder learning. We conduct experiments on the dataset crawled from Instagram. The experimental results show that MMGA works well on the dataset and improves the fans prediction task's performance. We release our dataset, the first social media multimodal dataset with graph, of 60,000 users labeled with specific topics based on 2 million posts to facilitate future research.