T2I-CompBench++: An Enhanced and Comprehensive Benchmark for Compositional Text-to-image Generation
This provides a comprehensive benchmark for researchers and developers to evaluate and improve compositional text-to-image generation, though it is incremental as it builds on existing benchmarks.
The paper tackles the problem of text-to-image models struggling with complex compositional scenes by introducing T2I-CompBench++, an enhanced benchmark with 8,000 prompts across categories like attribute binding and 3D-spatial relationships, and proposes new evaluation metrics including detection-based methods and MLLMs, benchmarking 11 state-of-the-art models such as FLUX.1 and SD3.
Despite the impressive advances in text-to-image models, they often struggle to effectively compose complex scenes with multiple objects, displaying various attributes and relationships. To address this challenge, we present T2I-CompBench++, an enhanced benchmark for compositional text-to-image generation. T2I-CompBench++ comprises 8,000 compositional text prompts categorized into four primary groups: attribute binding, object relationships, generative numeracy, and complex compositions. These are further divided into eight sub-categories, including newly introduced ones like 3D-spatial relationships and numeracy. In addition to the benchmark, we propose enhanced evaluation metrics designed to assess these diverse compositional challenges. These include a detection-based metric tailored for evaluating 3D-spatial relationships and numeracy, and an analysis leveraging Multimodal Large Language Models (MLLMs), i.e. GPT-4V, ShareGPT4v as evaluation metrics. Our experiments benchmark 11 text-to-image models, including state-of-the-art models, such as FLUX.1, SD3, DALLE-3, Pixart-$α$, and SD-XL on T2I-CompBench++. We also conduct comprehensive evaluations to validate the effectiveness of our metrics and explore the potential and limitations of MLLMs.