LGAICCCVJan 8, 2025

On Computational Limits and Provably Efficient Criteria of Visual Autoregressive Models: A Fine-Grained Complexity Analysis

arXiv:2501.04377v220 citationsh-index: 21
Originality Highly original
AI Analysis

This work addresses the computational bottleneck for researchers and practitioners using VAR models in scalable image generation, providing theoretical limits and criteria for efficiency, though it is incremental in advancing theoretical understanding.

The paper tackles the computational inefficiency of Visual Autoregressive (VAR) Models in image generation, proving that sub-quartic time algorithms are impossible under the Strong Exponential Time Hypothesis and presenting efficient low-rank approximations to achieve sub-quadratic time under certain conditions.

Recently, Visual Autoregressive ($\mathsf{VAR}$) Models introduced a groundbreaking advancement in the field of image generation, offering a scalable approach through a coarse-to-fine ``next-scale prediction'' paradigm. Suppose that $n$ represents the height and width of the last VQ code map generated by $\mathsf{VAR}$ models, the state-of-the-art algorithm in [Tian, Jiang, Yuan, Peng and Wang, NeurIPS 2024] takes $O(n^{4+o(1)})$ time, which is computationally inefficient. In this work, we analyze the computational limits and efficiency criteria of $\mathsf{VAR}$ Models through a fine-grained complexity lens. Our key contribution is identifying the conditions under which $\mathsf{VAR}$ computations can achieve sub-quadratic time complexity. We have proved that assuming the Strong Exponential Time Hypothesis ($\mathsf{SETH}$) from fine-grained complexity theory, a sub-quartic time algorithm for $\mathsf{VAR}$ models is impossible. To substantiate our theoretical findings, we present efficient constructions leveraging low-rank approximations that align with the derived criteria. This work initiates the study of the computational efficiency of the $\mathsf{VAR}$ model from a theoretical perspective. Our technique will shed light on advancing scalable and efficient image generation in $\mathsf{VAR}$ frameworks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes