CVAIDec 10, 2023

Separate-and-Enhance: Compositional Finetuning for Text2Image Diffusion Models

arXiv:2312.06712v223 citationsSIGGRAPH
Originality Incremental advance
AI Analysis

This work addresses a key limitation in text-to-image generation for users needing accurate multi-object scenes, though it is incremental as it builds on existing attention and mask-based approaches.

The paper tackled the problem of poor compositional generation in text-to-image diffusion models, particularly for multi-object scenes, by proposing a finetuning method with two novel objectives that reduce mask overlaps and maximize attention scores, resulting in superior performance in image realism and text-image alignment compared to baselines.

Despite recent significant strides achieved by diffusion-based Text-to-Image (T2I) models, current systems are still less capable of ensuring decent compositional generation aligned with text prompts, particularly for the multi-object generation. This work illuminates the fundamental reasons for such misalignment, pinpointing issues related to low attention activation scores and mask overlaps. While previous research efforts have individually tackled these issues, we assert that a holistic approach is paramount. Thus, we propose two novel objectives, the Separate loss and the Enhance loss, that reduce object mask overlaps and maximize attention scores, respectively. Our method diverges from conventional test-time-adaptation techniques, focusing on finetuning critical parameters, which enhances scalability and generalizability. Comprehensive evaluations demonstrate the superior performance of our model in terms of image realism, text-image alignment, and adaptability, notably outperforming prominent baselines. Ultimately, this research paves the way for T2I diffusion models with enhanced compositional capacities and broader applicability.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes