CVDec 12, 2025

Infinity and Beyond: Compositional Alignment in VAR and Diffusion T2I Models

arXiv:2512.11542v1h-index: 20
Originality Incremental advance
AI Analysis

This addresses the problem of generating images that accurately match textual descriptions for users of text-to-image models, providing a systematic comparison and baselines for future development.

The paper tackled the challenge of compositional alignment in text-to-image models by benchmarking six systems, finding that Infinity-8B achieved the strongest overall alignment across benchmarks, while Infinity-2B matched or exceeded larger diffusion models in several categories.

Achieving compositional alignment between textual descriptions and generated images - covering objects, attributes, and spatial relationships - remains a core challenge for modern text-to-image (T2I) models. Although diffusion-based architectures have been widely studied, the compositional behavior of emerging Visual Autoregressive (VAR) models is still largely unexamined. We benchmark six diverse T2I systems - SDXL, PixArt-$α$, Flux-Dev, Flux-Schnell, Infinity-2B, and Infinity-8B - across the full T2I-CompBench++ and GenEval suites, evaluating alignment in color and attribute binding, spatial relations, numeracy, and complex multi-object prompts. Across both benchmarks, Infinity-8B achieves the strongest overall compositional alignment, while Infinity-2B also matches or exceeds larger diffusion models in several categories, highlighting favorable efficiency-performance trade-offs. In contrast, SDXL and PixArt-$α$ show persistent weaknesses in attribute-sensitive and spatial tasks. These results provide the first systematic comparison of VAR and diffusion approaches to compositional alignment and establish unified baselines for the future development of the T2I model.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes