CV CL LGJun 15, 2022

Write and Paint: Generative Vision-Language Models are Unified Modal Learners

Shizhe Diao, Wangchunshu Zhou, Xinsong Zhang, Jiawei Wang

arXiv:2206.07699v314.519 citationsh-index: 26Has Code

Originality Incremental advance

AI Analysis

This work addresses the need for versatile multi-modal foundation models in AI, though it appears incremental by building on existing vision-language pre-training methods.

The paper tackles the problem of learning both image-to-text and text-to-image generation together in a unified model, proposing DaVinci, which achieves competitive performance on 27 generation and understanding tasks.

Recent advances in vision-language pre-training have pushed the state-of-the-art on various vision-language tasks, making machines more capable of multi-modal writing (image-to-text generation) and painting (text-to-image generation). However, few studies investigate if these two essential capabilities can be learned together and boost each other, making a versatile and powerful multi-modal foundation model. In this work, we disclose the potential of symmetric generative vision-language pre-training in learning to write and paint concurrently, and propose a new unified modal model, named DaVinci, trained with prefix language modeling and prefix image modeling, a simple generative self-supervised objective on image-text pairs. Thanks to the proposed prefix multi-modal modeling framework, DaVinci is simple to train, scalable to huge data, adaptable to both writing and painting tasks, and also strong on other vision, text, and multi-modal understanding tasks. DaVinci achieves competitive performance on a wide range of 27 generation/understanding tasks and demonstrates the superiority of combining vision/language generative pre-training. Furthermore, we carefully benchmark the performance of different vision-language pre-training objectives on different scales of pre-training datasets on a heterogeneous and broad distribution coverage. Our results demonstrate the potential of exploiting self-supervision in both language and vision inputs, and establish new, stronger baselines for future comparisons at different data scales. The code and pre-trained models are available at https://github.com/shizhediao/DaVinci.

View on arXiv PDF Code

Similar