CVFeb 8, 2022

MaskGIT: Masked Generative Image Transformer

arXiv:2202.04200v11185 citations
Originality Highly original
AI Analysis

This addresses the problem of slow image synthesis for computer vision researchers and practitioners, offering a novel paradigm with significant speed improvements.

The paper tackled the inefficiency of sequential image generation in transformers by proposing MaskGIT, a bidirectional transformer decoder that predicts masked tokens during training and refines images iteratively at inference. It outperformed state-of-the-art transformer models on ImageNet and accelerated decoding by up to 64x.

Generative transformers have experienced rapid popularity growth in the computer vision community in synthesizing high-fidelity and high-resolution images. The best generative transformer models so far, however, still treat an image naively as a sequence of tokens, and decode an image sequentially following the raster scan ordering (i.e. line-by-line). We find this strategy neither optimal nor efficient. This paper proposes a novel image synthesis paradigm using a bidirectional transformer decoder, which we term MaskGIT. During training, MaskGIT learns to predict randomly masked tokens by attending to tokens in all directions. At inference time, the model begins with generating all tokens of an image simultaneously, and then refines the image iteratively conditioned on the previous generation. Our experiments demonstrate that MaskGIT significantly outperforms the state-of-the-art transformer model on the ImageNet dataset, and accelerates autoregressive decoding by up to 64x. Besides, we illustrate that MaskGIT can be easily extended to various image editing tasks, such as inpainting, extrapolation, and image manipulation.

Code Implementations9 repos

Data from Papers with Code (CC-BY-SA-4.0)

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes