Efficient Multi-Instance Generation with Janus-Pro-Dirven Prompt Parsing
This addresses the challenge of synthesizing complex scenes with multiple objects for image generation applications, representing a strong incremental improvement.
The paper tackles the problem of text-guided diffusion models struggling with complex multi-object scenes by introducing Janus-Pro-driven Prompt Parsing for layout generation and MIGLoRA for parameter-efficient fine-tuning, achieving state-of-the-art performance on COCO and LVIS benchmarks.
Recent advances in text-guided diffusion models have revolutionized conditional image generation, yet they struggle to synthesize complex scenes with multiple objects due to imprecise spatial grounding and limited scalability. We address these challenges through two key modules: 1) Janus-Pro-driven Prompt Parsing, a prompt-layout parsing module that bridges text understanding and layout generation via a compact 1B-parameter architecture, and 2) MIGLoRA, a parameter-efficient plug-in integrating Low-Rank Adaptation (LoRA) into UNet (SD1.5) and DiT (SD3) backbones. MIGLoRA is capable of preserving the base model's parameters and ensuring plug-and-play adaptability, minimizing architectural intrusion while enabling efficient fine-tuning. To support a comprehensive evaluation, we create DescripBox and DescripBox-1024, benchmarks that span diverse scenes and resolutions. The proposed method achieves state-of-the-art performance on COCO and LVIS benchmarks while maintaining parameter efficiency, demonstrating superior layout fidelity and scalability for open-world synthesis.