CVMar 13, 2025

Autoregressive Image Generation with Randomized Parallel Decoding

arXiv:2503.10568v213 citationsh-index: 3
Originality Highly original
AI Analysis

This addresses the problem of slow inference and poor zero-shot generalization in autoregressive image models for researchers and practitioners in computer vision.

The paper tackles the inefficiency and limited generalization of raster-order autoregressive image generation by introducing ARPG, a model that enables randomized parallel decoding, achieving an FID of 1.83 on ImageNet-1K 256 with 32 sampling steps and a 30x inference speedup.

We introduce ARPG, a novel visual autoregressive model that enables randomized parallel generation, addressing the inherent limitations of conventional raster-order approaches, which hinder inference efficiency and zero-shot generalization due to their sequential, predefined token generation order. Our key insight is that effective random-order modeling necessitates explicit guidance for determining the position of the next predicted token. To this end, we propose a novel decoupled decoding framework that decouples positional guidance from content representation, encoding them separately as queries and key-value pairs. By directly incorporating this guidance into the causal attention mechanism, our approach enables fully random-order training and generation, eliminating the need for bidirectional attention. Consequently, ARPG readily generalizes to zero-shot inference tasks such as image inpainting, outpainting, and resolution expansion. Furthermore, it supports parallel inference by concurrently processing multiple queries using a shared KV cache. On the ImageNet-1K 256 benchmark, our approach attains an FID of 1.83 with only 32 sampling steps, achieving over a 30 times speedup in inference and a 75 percent reduction in memory consumption compared to representative recent autoregressive models at a similar scale.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes