CV AIOct 1, 2023

Reformulating Vision-Language Foundation Models and Datasets Towards Universal Multimodal Assistants

Tianyu Yu, Jinyi Hu, Yuan Yao, Haoye Zhang, Yue Zhao, Chongyi Wang, Shan Wang, Yinxv Pan, Jiao Xue, Dahai Li, Zhiyuan Liu, Hai-Tao Zheng

Tsinghua

arXiv:2310.00653v114.527 citationsh-index: 98Has Code

Originality Incremental advance

AI Analysis

This work addresses the challenge of building more efficient and effective universal multimodal assistants for AI applications, though it is incremental as it builds on existing MLLM architectures and datasets.

The paper tackles the problem of improving Multimodal Large Language Models (MLLMs) by proposing a new framework, Muffin, that uses pre-trained vision-language models as bridges without additional alignment pre-training, and a dataset, UniMM-Chat, with 1.1M high-quality multimodal instructions, achieving state-of-the-art performance on vision-language tasks and surpassing models like LLaVA and InstructBLIP.

Recent Multimodal Large Language Models (MLLMs) exhibit impressive abilities to perceive images and follow open-ended instructions. The capabilities of MLLMs depend on two crucial factors: the model architecture to facilitate the feature alignment of visual modules and large language models; the multimodal instruction tuning datasets for human instruction following. (i) For the model architecture, most existing models introduce an external bridge module to connect vision encoders with language models, which needs an additional feature-alignment pre-training. In this work, we discover that compact pre-trained vision language models can inherently serve as ``out-of-the-box'' bridges between vision and language. Based on this, we propose Muffin framework, which directly employs pre-trained vision-language models to act as providers of visual signals. (ii) For the multimodal instruction tuning datasets, existing methods omit the complementary relationship between different datasets and simply mix datasets from different tasks. Instead, we propose UniMM-Chat dataset which explores the complementarities of datasets to generate 1.1M high-quality and diverse multimodal instructions. We merge information describing the same image from diverse datasets and transforms it into more knowledge-intensive conversation data. Experimental results demonstrate the effectiveness of the Muffin framework and UniMM-Chat dataset. Muffin achieves state-of-the-art performance on a wide range of vision-language tasks, significantly surpassing state-of-the-art models like LLaVA and InstructBLIP. Our model and dataset are all accessible at https://github.com/thunlp/muffin.

View on arXiv PDF Code

Similar