DEIG: Detail-Enhanced Instance Generation with Fine-Grained Semantic Control
This addresses the challenge of fine-grained semantic understanding in multi-instance generation for applications requiring precise control over visual scenes, representing a novel method for a known bottleneck.
The paper tackles the problem of fine-grained semantic control in multi-instance generation by proposing DEIG, a framework that integrates an Instance Detail Extractor and Detail Fusion Module to generate visually coherent scenes from complex textual descriptions, achieving consistent outperformance over existing approaches in spatial consistency, semantic accuracy, and compositional generalization.
Multi-Instance Generation has advanced significantly in spatial placement and attribute binding. However, existing approaches still face challenges in fine-grained semantic understanding, particularly when dealing with complex textual descriptions. To overcome these limitations, we propose DEIG, a novel framework for fine-grained and controllable multi-instance generation. DEIG integrates an Instance Detail Extractor (IDE) that transforms text encoder embeddings into compact, instance-aware representations, and a Detail Fusion Module (DFM) that applies instance-based masked attention to prevent attribute leakage across instances. These components enable DEIG to generate visually coherent multi-instance scenes that precisely match rich, localized textual descriptions. To support fine-grained supervision, we construct a high-quality dataset with detailed, compositional instance captions generated by VLMs. We also introduce DEIG-Bench, a new benchmark with region-level annotations and multi-attribute prompts for both humans and objects. Experiments demonstrate that DEIG consistently outperforms existing approaches across multiple benchmarks in spatial consistency, semantic accuracy, and compositional generalization. Moreover, DEIG functions as a plug-and-play module, making it easily integrable into standard diffusion-based pipelines.