CLMay 12, 2023

Towards Versatile and Efficient Visual Knowledge Integration into Pre-trained Language Models with Cross-Modal Adapters

Xinyun Zhang, Haochen Tan, Han Wu, Bei Yu

arXiv:2305.07358v41.32 citations

Originality Incremental advance

AI Analysis

This addresses the challenge of enhancing PLMs with multi-modal information for better reasoning and understanding, representing an incremental improvement over existing methods.

The paper tackles the problem of integrating visual knowledge into pre-trained language models (PLMs) to overcome their text-only limitations, proposing a plug-and-play module called X-adapter that significantly improves performance on object-color reasoning and natural language understanding tasks compared to PLM baselines.

Humans learn language via multi-modal knowledge. However, due to the text-only pre-training scheme, most existing pre-trained language models (PLMs) are hindered from the multi-modal information. To inject visual knowledge into PLMs, existing methods incorporate either the text or image encoder of vision-language models (VLMs) to encode the visual information and update all the original parameters of PLMs for knowledge fusion. In this paper, we propose a new plug-and-play module, X-adapter, to flexibly leverage the aligned visual and textual knowledge learned in pre-trained VLMs and efficiently inject them into PLMs. Specifically, we insert X-adapters into PLMs, and only the added parameters are updated during adaptation. To fully exploit the potential in VLMs, X-adapters consist of two sub-modules, V-expert and T-expert, to fuse VLMs' image and text representations, respectively. We can opt for activating different sub-modules depending on the downstream tasks. Experimental results show that our method can significantly improve the performance on object-color reasoning and natural language understanding (NLU) tasks compared with PLM baselines.

View on arXiv PDF

Similar