A Two-Stage System for Layout-Controlled Image Generation using Large Language Models and Diffusion Models
This addresses the compositional limitations of diffusion models for users needing precise control in image generation, though it is incremental as it builds on existing methods like ControlNet and GLIGEN.
The paper tackles the problem of precise control over object counts and spatial arrangements in text-to-image generation by introducing a two-stage system that uses an LLM for layout planning and a diffusion model for image synthesis, achieving object recall improvement from 57.2% to 99.9% for complex scenes.
Text-to-image diffusion models exhibit remarkable generative capabilities, but lack precise control over object counts and spatial arrangements. This work introduces a two-stage system to address these compositional limitations. The first stage employs a Large Language Model (LLM) to generate a structured layout from a list of objects. The second stage uses a layout-conditioned diffusion model to synthesize a photorealistic image adhering to this layout. We find that task decomposition is critical for LLM-based spatial planning; by simplifying the initial generation to core objects and completing the layout with rule-based insertion, we improve object recall from 57.2% to 99.9% for complex scenes. For image synthesis, we compare two leading conditioning methods: ControlNet and GLIGEN. After domain-specific finetuning on table-setting datasets, we identify a key trade-off: ControlNet preserves text-based stylistic control but suffers from object hallucination, while GLIGEN provides superior layout fidelity at the cost of reduced prompt-based controllability. Our end-to-end system successfully generates images with specified object counts and plausible spatial arrangements, demonstrating the viability of a decoupled approach for compositionally controlled synthesis.