CVMay 21, 2024

An Empirical Study and Analysis of Text-to-Image Generation Using Large Language Model-Powered Textual Representation

arXiv:2405.12914v210 citationsh-index: 14ECCV
Originality Incremental advance
AI Analysis

This addresses the limitation of existing text-to-image models in handling non-English and long prompts, though it is incremental as it builds on prior methods with an adapter-based approach.

The paper tackled the problem of improving text understanding in text-to-image generation by replacing the CLIP text encoder with Large Language Models (LLMs), resulting in a model that supports multilingual and longer input contexts with superior image quality.

One critical prerequisite for faithful text-to-image generation is the accurate understanding of text inputs. Existing methods leverage the text encoder of the CLIP model to represent input prompts. However, the pre-trained CLIP model can merely encode English with a maximum token length of 77. Moreover, the model capacity of the text encoder from CLIP is relatively limited compared to Large Language Models (LLMs), which offer multilingual input, accommodate longer context, and achieve superior text representation. In this paper, we investigate LLMs as the text encoder to improve the language understanding in text-to-image generation. Unfortunately, training text-to-image generative model with LLMs from scratch demands significant computational resources and data. To this end, we introduce a three-stage training pipeline that effectively and efficiently integrates the existing text-to-image model with LLMs. Specifically, we propose a lightweight adapter that enables fast training of the text-to-image model using the textual representations from LLMs. Extensive experiments demonstrate that our model supports not only multilingual but also longer input context with superior image generation quality.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes