CVAICLLGJun 16, 2023

Energy-Based Cross Attention for Bayesian Context Update in Text-to-Image Diffusion Models

arXiv:2306.09869v335 citationsh-index: 38Has Code
Originality Incremental advance
AI Analysis

This addresses the problem of inaccurate image generation from text prompts for users of diffusion models, though it appears incremental as it builds on existing cross-attention mechanisms.

The paper tackles semantic misalignment in text-to-image diffusion models by introducing an energy-based model framework for adaptive context control, achieving improved performance in multi-concept generation, image inpainting, and editing tasks.

Despite the remarkable performance of text-to-image diffusion models in image generation tasks, recent studies have raised the issue that generated images sometimes cannot capture the intended semantic contents of the text prompts, which phenomenon is often called semantic misalignment. To address this, here we present a novel energy-based model (EBM) framework for adaptive context control by modeling the posterior of context vectors. Specifically, we first formulate EBMs of latent image representations and text embeddings in each cross-attention layer of the denoising autoencoder. Then, we obtain the gradient of the log posterior of context vectors, which can be updated and transferred to the subsequent cross-attention layer, thereby implicitly minimizing a nested hierarchy of energy functions. Our latent EBMs further allow zero-shot compositional generation as a linear combination of cross-attention outputs from different contexts. Using extensive experiments, we demonstrate that the proposed method is highly effective in handling various image generation tasks, including multi-concept generation, text-guided image inpainting, and real and synthetic image editing. Code: https://github.com/EnergyAttention/Energy-Based-CrossAttention.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes