LG AI CL CVFeb 10, 2025

Universal Approximation of Visual Autoregressive Transformers

Yifang Chen, Xiaoyu Li, Yingyu Liang, Zhenmei Shi, Zhao Song

arXiv:2502.06167v123.313 citationsh-index: 21

Originality Highly original

AI Analysis

This provides foundational design principles for efficient VAR transformers, advancing image synthesis and related areas, though it is incremental in extending universal approximation theory to visual models.

The paper tackles the problem of understanding the fundamental limits of Visual Autoregressive (VAR) transformers for image generation, proving that even simple single-head, single-layer VAR transformers are universal approximators for Lipschitz image-to-image functions and showing they outperform previous methods like Diffusion Transformers in quality.

We investigate the fundamental limits of transformer-based foundation models, extending our analysis to include Visual Autoregressive (VAR) transformers. VAR represents a big step toward generating images using a novel, scalable, coarse-to-fine ``next-scale prediction'' framework. These models set a new quality bar, outperforming all previous methods, including Diffusion Transformers, while having state-of-the-art performance for image synthesis tasks. Our primary contributions establish that, for single-head VAR transformers with a single self-attention layer and single interpolation layer, the VAR Transformer is universal. From the statistical perspective, we prove that such simple VAR transformers are universal approximators for any image-to-image Lipschitz functions. Furthermore, we demonstrate that flow-based autoregressive transformers inherit similar approximation capabilities. Our results provide important design principles for effective and computationally efficient VAR Transformer strategies that can be used to extend their utility to more sophisticated VAR models in image generation and other related areas.

View on arXiv PDF

Similar