CV AISep 5, 2023

Exchanging-based Multimodal Fusion with Transformer

Renyu Zhu, Chengcheng Han, Yong Qian, Qiushi Sun, Xiang Li, Ming Gao, Xuezhi Cao, Yunsen Xian

Stanford

arXiv:2309.02190v13.96 citationsh-index: 14Has Code

Originality Incremental advance

AI Analysis

This work addresses multimodal fusion for text-vision applications, offering a novel method that improves over existing approaches, though it is incremental in nature.

The authors tackled multimodal fusion for text-vision tasks by proposing MuSE, a model that uses Transformer-based encoders and decoders to regularize embeddings and exchange knowledge between modalities, achieving superior performance on Multimodal Named Entity Recognition and Multimodal Sentiment Analysis tasks.

We study the problem of multimodal fusion in this paper. Recent exchanging-based methods have been proposed for vision-vision fusion, which aim to exchange embeddings learned from one modality to the other. However, most of them project inputs of multimodalities into different low-dimensional spaces and cannot be applied to the sequential input data. To solve these issues, in this paper, we propose a novel exchanging-based multimodal fusion model MuSE for text-vision fusion based on Transformer. We first use two encoders to separately map multimodal inputs into different low-dimensional spaces. Then we employ two decoders to regularize the embeddings and pull them into the same space. The two decoders capture the correlations between texts and images with the image captioning task and the text-to-image generation task, respectively. Further, based on the regularized embeddings, we present CrossTransformer, which uses two Transformer encoders with shared parameters as the backbone model to exchange knowledge between multimodalities. Specifically, CrossTransformer first learns the global contextual information of the inputs in the shallow layers. After that, it performs inter-modal exchange by selecting a proportion of tokens in one modality and replacing their embeddings with the average of embeddings in the other modality. We conduct extensive experiments to evaluate the performance of MuSE on the Multimodal Named Entity Recognition task and the Multimodal Sentiment Analysis task. Our results show the superiority of MuSE against other competitors. Our code and data are provided at https://github.com/RecklessRonan/MuSE.

View on arXiv PDF Code

Similar