CVSep 27, 2023

Emu: Enhancing Image Generation Models Using Photogenic Needles in a Haystack

Meta AI
arXiv:2309.15807v1302 citationsh-index: 45
Originality Incremental advance
AI Analysis

This addresses the need for aesthetic alignment in text-to-image generation for users requiring high-quality outputs, though it is incremental as it builds on existing pre-trained models.

The paper tackles the problem of text-to-image models generating low aesthetic quality images by proposing quality-tuning, which fine-tunes pre-trained models with a small set of highly appealing images, resulting in Emu achieving win rates of 82.9% over its pre-trained counterpart and 68.4-71.3% over SDXLv1.0 on visual appeal benchmarks.

Training text-to-image models with web scale image-text pairs enables the generation of a wide range of visual concepts from text. However, these pre-trained models often face challenges when it comes to generating highly aesthetic images. This creates the need for aesthetic alignment post pre-training. In this paper, we propose quality-tuning to effectively guide a pre-trained model to exclusively generate highly visually appealing images, while maintaining generality across visual concepts. Our key insight is that supervised fine-tuning with a set of surprisingly small but extremely visually appealing images can significantly improve the generation quality. We pre-train a latent diffusion model on $1.1$ billion image-text pairs and fine-tune it with only a few thousand carefully selected high-quality images. The resulting model, Emu, achieves a win rate of $82.9\%$ compared with its pre-trained only counterpart. Compared to the state-of-the-art SDXLv1.0, Emu is preferred $68.4\%$ and $71.3\%$ of the time on visual appeal on the standard PartiPrompts and our Open User Input benchmark based on the real-world usage of text-to-image models. In addition, we show that quality-tuning is a generic approach that is also effective for other architectures, including pixel diffusion and masked generative transformer models.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes