CVMay 5, 2025

MCCD: Multi-Agent Collaboration-based Compositional Diffusion for Complex Text-to-Image Generation

arXiv:2505.02648v224 citationsh-index: 27CVPR
Originality Incremental advance
AI Analysis

This addresses a performance bottleneck in text-to-image generation for complex scenes, offering an incremental improvement over existing methods.

The paper tackles the problem of generating images from complex text prompts involving multiple objects and relations, proposing MCCD which improves baseline models in a training-free manner and achieves significant performance gains.

Diffusion models have shown excellent performance in text-to-image generation. Nevertheless, existing methods often suffer from performance bottlenecks when handling complex prompts that involve multiple objects, characteristics, and relations. Therefore, we propose a Multi-agent Collaboration-based Compositional Diffusion (MCCD) for text-to-image generation for complex scenes. Specifically, we design a multi-agent collaboration-based scene parsing module that generates an agent system comprising multiple agents with distinct tasks, utilizing MLLMs to extract various scene elements effectively. In addition, Hierarchical Compositional diffusion utilizes a Gaussian mask and filtering to refine bounding box regions and enhance objects through region enhancement, resulting in the accurate and high-fidelity generation of complex scenes. Comprehensive experiments demonstrate that our MCCD significantly improves the performance of the baseline models in a training-free manner, providing a substantial advantage in complex scene generation.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes