CVCLLGMar 21, 2023

MAGVLT: Masked Generative Vision-and-Language Transformer

arXiv:2303.12208v118 citationsh-index: 19
Originality Incremental advance
AI Analysis

This addresses the need for efficient and versatile multimodal generative models in AI, though it is incremental as it builds on existing transformer and masked prediction techniques.

The paper tackles the problem of generating both images and text with a single unified model, proposing MAGVLT, a non-autoregressive masked generative vision-and-language transformer, which outperforms autoregressive baselines by a large margin with significant inference speedup and achieves competitive results on zero-shot generation tasks from MS-COCO with fewer than 500M parameters.

While generative modeling on multimodal image-text data has been actively developed with large-scale paired datasets, there have been limited attempts to generate both image and text data by a single model rather than a generation of one fixed modality conditioned on the other modality. In this paper, we explore a unified generative vision-and-language (VL) model that can produce both images and text sequences. Especially, we propose a generative VL transformer based on the non-autoregressive mask prediction, named MAGVLT, and compare it with an autoregressive generative VL transformer (ARGVLT). In comparison to ARGVLT, the proposed MAGVLT enables bidirectional context encoding, fast decoding by parallel token predictions in an iterative refinement, and extended editing capabilities such as image and text infilling. For rigorous training of our MAGVLT with image-text pairs from scratch, we combine the image-to-text, text-to-image, and joint image-and-text mask prediction tasks. Moreover, we devise two additional tasks based on the step-unrolled mask prediction and the selective prediction on the mixture of two image-text pairs. Experimental results on various downstream generation tasks of VL benchmarks show that our MAGVLT outperforms ARGVLT by a large margin even with significant inference speedup. Particularly, MAGVLT achieves competitive results on both zero-shot image-to-text and text-to-image generation tasks from MS-COCO by one moderate-sized model (fewer than 500M parameters) even without the use of monomodal data and networks.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes