CVLGOct 29, 2025

Hawk: Leveraging Spatial Context for Faster Autoregressive Text-to-Image Generation

arXiv:2510.25739v12 citationsh-index: 35
Originality Incremental advance
AI Analysis

This addresses the inference speed bottleneck for users of autoregressive image generation models, though it is incremental as it builds on speculative decoding.

The paper tackled the slow inference problem in autoregressive text-to-image generation by introducing Hawk, a method that uses spatial context to improve speculative decoding, achieving a 1.71x speedup while maintaining image quality and diversity.

Autoregressive (AR) image generation models are capable of producing high-fidelity images but often suffer from slow inference due to their inherently sequential, token-by-token decoding process. Speculative decoding, which employs a lightweight draft model to approximate the output of a larger AR model, has shown promise in accelerating text generation without compromising quality. However, its application to image generation remains largely underexplored. The challenges stem from a significantly larger sampling space, which complicates the alignment between the draft and target model outputs, coupled with the inadequate use of the two-dimensional spatial structure inherent in images, thereby limiting the modeling of local dependencies. To overcome these challenges, we introduce Hawk, a new approach that harnesses the spatial structure of images to guide the speculative model toward more accurate and efficient predictions. Experimental results on multiple text-to-image benchmarks demonstrate a 1.71x speedup over standard AR models, while preserving both image fidelity and diversity.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes