CVMar 22, 2024

CLIP-VQDiffusion : Langauge Free Training of Text To Image generation using CLIP and vector quantized diffusion model

arXiv:2403.14944v14 citationsh-index: 3Has Code

Originality Incremental advance

AI Analysis

This work addresses the challenge of developing text-to-image models for datasets lacking text captions, such as face datasets, by enabling language-free training, though it is incremental as it builds on existing CLIP and diffusion techniques.

The authors tackled the problem of training text-to-image generation models without requiring costly text-image paired datasets by leveraging a pretrained CLIP model for multimodal representations. Their method, CLIP-VQDiffusion, outperformed previous state-of-the-art methods by 4.4% in clipscore on the FFHQ dataset and generated realistic images even with out-of-distribution text.

There has been a significant progress in text conditional image generation models. Recent advancements in this field depend not only on improvements in model structures, but also vast quantities of text-image paired datasets. However, creating these kinds of datasets is very costly and requires a substantial amount of labor. Famous face datasets don't have corresponding text captions, making it difficult to develop text conditional image generation models on these datasets. Some research has focused on developing text to image generation models using only images without text captions. Here, we propose CLIP-VQDiffusion, which leverage the pretrained CLIP model to provide multimodal text-image representations and strong image generation capabilities. On the FFHQ dataset, our model outperformed previous state-of-the-art methods by 4.4% in clipscore and generated very realistic images even when the text was both in and out of distribution. The pretrained models and codes will soon be available at https://github.com/INFINIQ-AI1/CLIPVQDiffusion

View on arXiv PDF Code

Similar