Canvas-to-Image: Compositional Image Generation with Multimodal Controls
This addresses the challenge for users needing precise control in image generation, though it is incremental as it builds on existing diffusion models.
The paper tackles the problem of generating images with high-fidelity compositional and multimodal control, such as text, subject references, and spatial arrangements, by introducing Canvas-to-Image, a unified framework that consolidates these controls into a single canvas interface, resulting in significant outperformance over state-of-the-art methods in identity preservation and control adherence across benchmarks.
While modern diffusion models excel at generating high-quality and diverse images, they still struggle with high-fidelity compositional and multimodal control, particularly when users simultaneously specify text prompts, subject references, spatial arrangements, pose constraints, and layout annotations. We introduce Canvas-to-Image, a unified framework that consolidates these heterogeneous controls into a single canvas interface, enabling users to generate images that faithfully reflect their intent. Our key idea is to encode diverse control signals into a single composite canvas image that the model can directly interpret for integrated visual-spatial reasoning. We further curate a suite of multi-task datasets and propose a Multi-Task Canvas Training strategy that optimizes the diffusion model to jointly understand and integrate heterogeneous controls into text-to-image generation within a unified learning paradigm. This joint training enables Canvas-to-Image to reason across multiple control modalities rather than relying on task-specific heuristics, and it generalizes well to multi-control scenarios during inference. Extensive experiments show that Canvas-to-Image significantly outperforms state-of-the-art methods in identity preservation and control adherence across challenging benchmarks, including multi-person composition, pose-controlled composition, layout-constrained generation, and multi-control generation.