CL CVMay 24, 2022

mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections

Chenliang Li, Haiyang Xu, Junfeng Tian, Wei Wang, Ming Yan, Bin Bi, Jiabo Ye, Hehong Chen, Guohai Xu, Zheng Cao, Ji Zhang, Songfang Huang

arXiv:2205.12005v229.0392 citationsh-index: 62Has Code

Originality Incremental advance

AI Analysis

This addresses efficiency and performance issues in vision-language AI for applications requiring cross-modal understanding and generation, though it is incremental in improving existing architectures.

The paper tackles the problems of low computational efficiency and information asymmetry in vision-language models by introducing mPLUG, a model with cross-modal skip-connections that skips layers for faster visual processing, achieving state-of-the-art results on tasks like image captioning and visual question answering.

Large-scale pretrained foundation models have been an emerging paradigm for building artificial intelligence (AI) systems, which can be quickly adapted to a wide range of downstream tasks. This paper presents mPLUG, a new vision-language foundation model for both cross-modal understanding and generation. Most existing pre-trained models suffer from the problems of low computational efficiency and information asymmetry brought by the long visual sequence in cross-modal alignment. To address these problems, mPLUG introduces an effective and efficient vision-language architecture with novel cross-modal skip-connections, which creates inter-layer shortcuts that skip a certain number of layers for time-consuming full self-attention on the vision side. mPLUG is pre-trained end-to-end on large-scale image-text pairs with both discriminative and generative objectives. It achieves state-of-the-art results on a wide range of vision-language downstream tasks, such as image captioning, image-text retrieval, visual grounding and visual question answering. mPLUG also demonstrates strong zero-shot transferability when directly transferred to multiple video-language tasks.

View on arXiv PDF Code

Similar