CVAILGJun 12, 2025

SpectralAR: Spectral Autoregressive Visual Generation

arXiv:2506.10962v110 citationsh-index: 26
Originality Highly original
AI Analysis

This addresses a fundamental limitation in autoregressive visual generation for image synthesis, offering a more efficient and causal approach.

The paper tackles the contradiction between parallel image patches and causal autoregressive modeling by proposing SpectralAR, which transforms images into ordered spectral tokens for coarse-to-fine generation, achieving 3.02 gFID on ImageNet-1K with 64 tokens and 310M parameters.

Autoregressive visual generation has garnered increasing attention due to its scalability and compatibility with other modalities compared with diffusion models. Most existing methods construct visual sequences as spatial patches for autoregressive generation. However, image patches are inherently parallel, contradicting the causal nature of autoregressive modeling. To address this, we propose a Spectral AutoRegressive (SpectralAR) visual generation framework, which realizes causality for visual sequences from the spectral perspective. Specifically, we first transform an image into ordered spectral tokens with Nested Spectral Tokenization, representing lower to higher frequency components. We then perform autoregressive generation in a coarse-to-fine manner with the sequences of spectral tokens. By considering different levels of detail in images, our SpectralAR achieves both sequence causality and token efficiency without bells and whistles. We conduct extensive experiments on ImageNet-1K for image reconstruction and autoregressive generation, and SpectralAR achieves 3.02 gFID with only 64 tokens and 310M parameters. Project page: https://huang-yh.github.io/spectralar/.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes