CVSep 7, 2024

Rethinking The Training And Evaluation of Rich-Context Layout-to-Image Generation

arXiv:2409.04847v26 citationsh-index: 18
AI Analysis

This work addresses challenges in generative modeling for layout-to-image tasks, offering incremental improvements in representation and evaluation methods.

The paper tackles the problem of layout-to-image generation by introducing a regional cross-attention module to improve handling of complex textual descriptions, and proposes new metrics for open-vocabulary evaluation, validated through a user study.

Recent advancements in generative models have significantly enhanced their capacity for image generation, enabling a wide range of applications such as image editing, completion and video editing. A specialized area within generative modeling is layout-to-image (L2I) generation, where predefined layouts of objects guide the generative process. In this study, we introduce a novel regional cross-attention module tailored to enrich layout-to-image generation. This module notably improves the representation of layout regions, particularly in scenarios where existing methods struggle with highly complex and detailed textual descriptions. Moreover, while current open-vocabulary L2I methods are trained in an open-set setting, their evaluations often occur in closed-set environments. To bridge this gap, we propose two metrics to assess L2I performance in open-vocabulary scenarios. Additionally, we conduct a comprehensive user study to validate the consistency of these metrics with human preferences.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes