CV LGMay 26, 2021

CogView: Mastering Text-to-Image Generation via Transformers

Ming Ding, Zhuoyi Yang, Wenyi Hong, Wendi Zheng, Chang Zhou, Da Yin, Junyang Lin, Xu Zou, Zhou Shao, Hongxia Yang, Jie Tang

arXiv:2105.13290v346.3989 citationsHas Code

Originality Incremental advance

AI Analysis

This work addresses the open problem of general-domain text-to-image generation, which is significant for applications in creative and design fields, though it appears incremental as it builds on existing transformer and VQ-VAE methods.

The authors tackled text-to-image generation by proposing CogView, a 4-billion-parameter Transformer with VQ-VAE tokenizer, achieving state-of-the-art FID on the blurred MS COCO dataset and outperforming previous models like GAN-based ones and DALL-E.

Text-to-Image generation in the general domain has long been an open problem, which requires both a powerful generative model and cross-modal understanding. We propose CogView, a 4-billion-parameter Transformer with VQ-VAE tokenizer to advance this problem. We also demonstrate the finetuning strategies for various downstream tasks, e.g. style learning, super-resolution, text-image ranking and fashion design, and methods to stabilize pretraining, e.g. eliminating NaN losses. CogView achieves the state-of-the-art FID on the blurred MS COCO dataset, outperforming previous GAN-based models and a recent similar work DALL-E.

View on arXiv PDF Code

Similar