CVSep 30, 2024

Magnet: We Never Know How Text-to-Image Diffusion Models Work, Until We Learn How Vision-Language Models Function

arXiv:2409.19967v117.823 citationsh-index: 2Has Code

Originality Incremental advance

AI Analysis

This addresses a key limitation in text-to-image generation for users needing precise control over complex scenes, though it is incremental as it builds on existing models.

The paper tackles the problem of attribute binding in text-to-image diffusion models, where synthesis quality deteriorates with complex prompts, and proposes Magnet, a training-free method that significantly improves synthesis quality and binding accuracy with negligible computational cost.

Text-to-image diffusion models particularly Stable Diffusion, have revolutionized the field of computer vision. However, the synthesis quality often deteriorates when asked to generate images that faithfully represent complex prompts involving multiple attributes and objects. While previous studies suggest that blended text embeddings lead to improper attribute binding, few have explored this in depth. In this work, we critically examine the limitations of the CLIP text encoder in understanding attributes and investigate how this affects diffusion models. We discern a phenomenon of attribute bias in the text space and highlight a contextual issue in padding embeddings that entangle different concepts. We propose \textbf{Magnet}, a novel training-free approach to tackle the attribute binding problem. We introduce positive and negative binding vectors to enhance disentanglement, further with a neighbor strategy to increase accuracy. Extensive experiments show that Magnet significantly improves synthesis quality and binding accuracy with negligible computational cost, enabling the generation of unconventional and unnatural concepts.

View on arXiv PDF Code

Similar