CVLGMar 28, 2023

Variational Distribution Learning for Unsupervised Text-to-Image Generation

arXiv:2303.16105v14 citationsh-index: 57
Originality Incremental advance
AI Analysis

This addresses the problem of generating images from text without labeled data, which is incremental as it builds on CLIP and variational methods.

The paper tackles unsupervised text-to-image generation without paired captions during training by using a pretrained CLIP model and variational inference to align image-text embeddings, achieving results that outperform existing methods by large margins.

We propose a text-to-image generation algorithm based on deep neural networks when text captions for images are unavailable during training. In this work, instead of simply generating pseudo-ground-truth sentences of training images using existing image captioning methods, we employ a pretrained CLIP model, which is capable of properly aligning embeddings of images and corresponding texts in a joint space and, consequently, works well on zero-shot recognition tasks. We optimize a text-to-image generation model by maximizing the data log-likelihood conditioned on pairs of image-text CLIP embeddings. To better align data in the two domains, we employ a principled way based on a variational inference, which efficiently estimates an approximate posterior of the hidden text embedding given an image and its CLIP feature. Experimental results validate that the proposed framework outperforms existing approaches by large margins under unsupervised and semi-supervised text-to-image generation settings.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes