CVLGNov 15, 2024

CART: Compositional Auto-Regressive Transformer for Image Generation

arXiv:2411.10180v33 citationsh-index: 7
Originality Highly original
AI Analysis

This addresses the problem of generating controllable and interpretable images for vision tasks, representing a novel method for a known bottleneck in auto-regressive image modeling.

The paper tackles the challenge of applying auto-regressive models to image generation by proposing CART, which models images as hierarchical compositions of interpretable visual layers, improving controllability, semantic interpretability, and resolution scalability.

We propose a novel Auto-Regressive (AR) image generation approach that models images as hierarchical compositions of interpretable visual layers. While AR models have achieved transformative success in language modeling, replicating this success in vision tasks remains challenging due to inherent spatial dependencies in images. Addressing the unique challenges of vision tasks, our method (CART) adds image details iteratively via semantically meaningful decompositions. We demonstrate the flexibility and generality of CART by applying it across three distinct decomposition strategies: (i) Base-Detail Decomposition (Mumford-Shah smoothness), (ii) Intrinsic Decomposition (albedo/shading), and (iii) Specularity Decomposition (diffuse/specular). This next-detail strategy outperforms traditional next-token and next-scale approaches, improving controllability, semantic interpretability, and resolution scalability. Experiments show CART generates visually compelling results while enabling structured image manipulation, opening new directions for controllable generative modeling via physically or perceptually motivated image factorization.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes