CVDec 14, 2023

VL-GPT: A Generative Pre-trained Transformer for Vision and Language Understanding and Generation

Tencent
arXiv:2312.09251v147 citationsh-index: 44
Originality Highly original
AI Analysis

This work addresses the challenge of multimodal AI integration for applications in vision-language tasks, representing a novel method rather than an incremental improvement.

The authors tackled the problem of unified vision-language modeling by introducing VL-GPT, a transformer model that processes and generates both image and text data through a novel image tokenizer and auto-regressive pre-training, achieving strong zero-shot and few-shot performance on tasks like image captioning and visual question answering.

In this work, we introduce Vision-Language Generative Pre-trained Transformer (VL-GPT), a transformer model proficient at concurrently perceiving and generating visual and linguistic data. VL-GPT achieves a unified pre-training approach for both image and text modalities by employing a straightforward auto-regressive objective, thereby enabling the model to process image and text as seamlessly as a language model processes text. To accomplish this, we initially propose a novel image tokenizer-detokenizer framework for visual data, specifically designed to transform raw images into a sequence of continuous embeddings and reconstruct them accordingly. In combination with the existing text tokenizer and detokenizer, this framework allows for the encoding of interleaved image-text data into a multimodal sequence, which can subsequently be fed into the transformer model. Consequently, VL-GPT can perform large-scale pre-training on multimodal corpora utilizing a unified auto-regressive objective (i.e., next-token prediction). Upon completion of pre-training, VL-GPT exhibits remarkable zero-shot and few-shot performance across a diverse range of vision and language understanding and generation tasks, including image captioning, visual question answering, text-to-image generation, and more. Additionally, the pre-trained model retrains in-context learning capabilities when provided with multimodal prompts. We further conduct instruction tuning on our VL-GPT, highlighting its exceptional potential for multimodal assistance. The source code and model weights shall be released.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes