CVJan 2, 2025

EliGen: Entity-Level Controlled Image Generation with Regional Attention

arXiv:2501.01097v329 citationsh-index: 11Has CodeMMAsia
Originality Highly original
AI Analysis

This addresses the limitation of global text prompts for detailed entity control in image generation, offering a solution for applications requiring precise visual editing.

The paper tackles the problem of achieving fine-grained control over individual entities in text-to-image generation by introducing EliGen, a framework that uses regional attention and a high-quality dataset to enable entity-level manipulation, surpassing existing methods in spatial precision and image quality.

Recent advancements in diffusion models have significantly advanced text-to-image generation, yet global text prompts alone remain insufficient for achieving fine-grained control over individual entities within an image. To address this limitation, we present EliGen, a novel framework for Entity-level controlled image Generation. Firstly, we put forward regional attention, a mechanism for diffusion transformers that requires no additional parameters, seamlessly integrating entity prompts and arbitrary-shaped spatial masks. By contributing a high-quality dataset with fine-grained spatial and semantic entity-level annotations, we train EliGen to achieve robust and accurate entity-level manipulation, surpassing existing methods in both spatial precision and image quality. Additionally, we propose an inpainting fusion pipeline, extending its capabilities to multi-entity image inpainting tasks. We further demonstrate its flexibility by integrating it with other open-source models such as IP-Adapter, In-Context LoRA and MLLM, unlocking new creative possibilities. The source code, model, and dataset are published at https://github.com/modelscope/DiffSynth-Studio.git.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes